simple-data-analysis
Easy-to-use and high-performance JavaScript library for data analysis.
README
Simple data analysis (SDA) in JavaScript
This repository is maintained by Nael Shiab, computational journalist and senior data producer for CBC News.
To install with NPM:
- ```
- npm i simple-data-analysis
- ```
The documentation is available here.
Core principles
These project's goals are:
-   To offer a high-performance and convenient solution in JavaScript for data analysis. It's based on DuckDB and inspired by Pandas (Python) and the Tidyverse (R).
-   To standardize and accelerate frontend/backend workflows with a simple-to-use library working both in the browser and with NodeJS (and similar runtimes).
-   To ease the way for non-coders (especially journalists and web developers) into the beautiful world of data analysis and data visualization in JavaScript.
SDA is based on duckdb-node and duckdb-wasm. DuckDB is a high-performance analytical database system. Under the hood, SDA sends SQL queries to be executed by DuckDB.
You also have the flexibility of writing your own queries if you want to (check the customQuery method) or to use JavaScript to process your data (check the updateWithJS method).
Feel free to start a conversation or open an issue. Check how you can contribute.
About v2
Because v1.x.x versions weren't based on DuckDB, v2.0.1 is a complete rewrite of the library with many breaking changes.
To test and compare the performance of simple-data-analysis@2.0.1, we calculated the average temperature per decade and city with the daily temperatures from the Adjusted and Homogenized Canadian Climate Data. See this repository for the code.
We ran the same calculations with simple-data-analysis@1.8.1 (both NodeJS and Bun), Pandas (Python), and the Tidyverse (R).
In each script, we:
1. Loaded a CSV file (_Importing_)
2. Selected four columns, removed rows with missing temperature, converted date strings to date and temperature strings to float (_Cleaning_)
3. Added a new column _decade_ and calculated the decade (_Modifying_)
4. Calculated the average temperature per decade and city (_Summarizing_)
5. Wrote the cleaned-up data that we computed the averages from in a new CSV file (_Writing_)
Each script has been run ten times on a MacBook Pro (Apple M1 Pro / 16 GB), and the durations have been averaged.
The charts displayed below come from this Observable notebook.
Small file
With _ahccd-samples.csv_:
-   74.7 MB
-   19 cities
-   20 columns
-   971,804 rows
-   19,436,080 data points
As we can see, simple-data-analysis@1.8.1 was the slowest, but simple-data-analysis@2.0.1 is now the fastest.
A chart showing the processing duration of multiple scripts in various languages
Big file
With _ahccd.csv_:
-   1.7 GB
-   773 cities
-   20 columns
-   22,051,025 rows
-   441,020,500 data points
The file was too big for simple-data-analysis@1.8.1, so it's not included here.
Again, simple-data-analysis@2.0.1 is now the fastest option.
A chart showing the processing duration of multiple scripts in various languages
SDA in an Observable notebook
Observable notebooks are great for data analysis in JavaScript. This example shows you how to use simple-data-analysis in one of them.
SDA in an HTML page
If you want to add the library directly to your webpage, you can use a npm-based CDN like jsDelivr.
Here's some code that you can copy an paste into an HTML file. For more methods, check the SimpleDB class documentation.
- ```html
- <script type="module">
- // We import the SimpleDB class from the esm bundle.
- import { SimpleDB } from "https://cdn.jsdelivr.net/npm/simple-data-analysis/+esm"
- async function main() {
- // We start a new instance of SimpleDB
- const sdb = new SimpleDB()
- // We load daily temperatures for three cities.
- // We put the data in the table dailyTemperatures.
- await sdb.loadData(
- "dailyTemperatures",
- "https://raw.githubusercontent.com/nshiab/simple-data-analysis/main/test/data/files/dailyTemperatures.csv"
- )
- // We compute the decade from each date
- // and put the result in the decade column.
- await sdb.addColumn(
- "dailyTemperatures",
- "decade",
- "integer",
- "FLOOR(YEAR(time)/10)*10" // This is SQL
- )
- // We summarize the data by computing
- // the average dailyTemperature
- // per decade and per city.
- await sdb.summarize("dailyTemperatures", {
- values: "t",
- categories: ["decade", "id"],
- summaries: "mean",
- })
- // We run linear regressions
- // to check for trends.
- await sdb.linearRegressions("dailyTemperatures", {
- x: "decade",
- y: "mean",
- categories: "id",
- decimals: 4,
- })
- // The dailyTemperature table does not have
- // the name of the cities, just the ids.
- // We load another file with the names
- // in the table cities.
- await sdb.loadData(
- "cities",
- "https://raw.githubusercontent.com/nshiab/simple-data-analysis/main/test/data/files/cities.csv"
- )
- // We join the two tables based
- // on the ids and put the joined rows
- // in the table results.
- await sdb.join("dailyTemperatures", "cities", "id", "left", "results")
- // We select the columns of interest
- // in the table results.
- await sdb.selectColumns("results", [
- "city",
- "slope",
- "yIntercept",
- "r2",
- ])
- // We log the results table.
- await sdb.logTable("results")
- // We store the data in a variable.
- const results = await sdb.getData("results")
- }
- main()
- </script>
- ```
And here's the table you'll see in your browser's console tab.
 探客时代
探客时代
