JuliaDB can load data that is too big to fit in memory (RAM) as well as run a subset of operations on big tables. In particular, OnlineStats Integration works with
groupreduce for running statistical analyses that traditionally would not be possible!
- Data is loaded into a distributed dataset containing "chunks" that safely fit in memory.
- Data is processed
Distributed.nworkers()chunks at a time (each worker processes a chunk and then moves onto the next chunk).
- Note: This means
Distributed.nworkers() * avg_size_of_chunkwill be in RAM simultaneously.
- Note: This means
- Output data is accumulated in-memory.
The limitations of this processing scheme is that only certain operations work out-of-core:
loadndsparse functions accept the keyword arguments
chunks that specify the directory to save the data into and the number of chunks to be generated from the input files, respectively.
Here's an example:
loadtable(glob("*.csv"), output="bin", chunks=100; kwargs...)
Suppose there are 800
.csv files in the current directory. They will be read into 100 chunks (8 files per chunk). Each worker process will load 8 files into memory, save the chunk into a single binary file in the
bin directory, and move onto the next 8 files.
Distributed.nworkers() * (number_of_csvs / chunks) needs to fit in memory simultaneously.
Once data has been loaded in this way, you can reload the dataset (extremely fast) via
tbl = load("bin")
reduce is the simplest out-of-core operation since it works pair-wise. You can also perform group-by operations with a reducer via
using JuliaDB, OnlineStats x = rand(Bool, 100) y = x + randn(100) t = table((x=x, y=y)) groupreduce(+, t, :x; select = :y)
Table with 2 rows, 2 columns: x + ────────────── false 9.80145 true 55.0632
You can replace the reducer with any
OnlineStat object (see OnlineStats Integration for more details):
groupreduce(Sum(), t, :x; select = :y)
Table with 2 rows, 2 columns: x Sum ──────────────────────────────── false Sum: n=47 | value=9.80145 true Sum: n=53 | value=55.0632
Join to Big Table
join operations have limited out-of-core support. Specifically,
join(bigtable, smalltable; broadcast=:right, how=:inner|:left|:anti)
bigtable can be larger than memory, while
Distributed.nworkers() copies of
smalltable must fit in memory. Note that only
:anti joins are supported (no
:outer joins). In this operation,
smalltable is first broadcast to all processors and
bigtable is joined
Distributed.nworkers() chunks at a time.