API Reference
Using datasets
The primary mechanism for loading datasets is the dataset
function, coupled with open()
to open the resulting DataSet
as some Julia type. In addition, DataSets.jl provides two macros @datafunc
and @datarun
to help in creating program entry points and running them.
DataSets.dataset
— Functiondataset(name)
dataset(project, name)
Returns the DataSet
with the given name
from project
. If omitted, the global data environment DataSets.PROJECT
will be used.
The DataSet
is metadata, but to use the actual data in your program you need to use the open
function to access the DataSet
's content as a given Julia type.
Example
To open a dataset named "a_text_file"
and read the whole content as a String,
content = open(String, dataset("a_text_file"))
To open the same dataset as an IO
stream and read only the first line,
open(IO, dataset("a_text_file")) do io
line = readline(io)
@info "The first line is" line
end
To open a directory as a browsable tree object,
open(BlobTree, dataset("a_tree_example"))
DataSets.@datafunc
— Macro@datafunc function f(x::DT=>T, y::DS=>S...)
...
end
Define the function f(x::T, y::S, ...)
and add data dispatch rules so that f(x::DataSet, y::DataSet)
will open datasets matching dataset types DT,DS
as Julia types T,S
.
DataSets.@datarun
— Macro@datarun [proj] func(args...)
Run func
with the named DataSet
s from the list args
.
Example
Load DataSet
s named a,b as defined in Data.toml, and pass them to f()
.
proj = DataSets.load_project("Data.toml")
@datarun proj f("a", "b")
Data environment
The global data environment for the session is defined by DataSets.PROJECT
which is initialized from the JULIA_DATASETS_PATH
environment variable. To load a data project from a particular TOML file, use DataSets.load_project
.
DataSets.PROJECT
— ConstantDataSets.PROJECT
contains the default global data environment for the Julia process. This is created from the JULIA_DATASETS_PATH
environment variable at initialization which is a list of paths (separated by :
or ;
on windows).
In analogy to Base.LOAD_PATH
and Base.DEPOT_PATH
, the path components are interpreted as follows:
@
means the path of the current active project as returned byBase.active_project(false)
This can be useful when you're "doing scripting" and you've got a project-specific Data.toml which resides next to the Project.toml. This only applies to projects which are explicitly set withjulia --project
orPkg.activate()
.- Explicit paths may be either directories or files in Data.toml format. For directories, the filename "Data.toml" is implicitly appended.
expanduser()
is used to expand the user's home directory. - As in
DEPOT_PATH
, an empty path component means the user's default Julia home directory,joinpath(homedir(), ".julia", "datasets")
This simplified version of the code loading rules (LOADPATH/DEPOTPATH) is used as it seems unlikely that we'll want data location to be version- dependent in the same way that that code is.
Unlike LOAD_PATH
, JULIA_DATASETS_PATH
is represented inside the program as a StackedDataProject
, and users can add custom projects by defining their own AbstractDataProject
subtypes.
Additional projects may be added or removed from the stack with pushfirst!
, push!
and empty!
.
DataSets.load_project
— Functionload_project(path; auto_update=false)
load_project(config_dict)
Load a data project from a system path
referring to a TOML file. If auto_update
is true, the returned project will monitor the file for updates and reload when necessary.
Alternatively, create a DataProject
from a an existing dictionary config_dict
, which should be in the Data.toml format.
See also load_project!
.
DataSets.load_project!
— Functionload_project!(path_or_config)
Prepends to the default global dataset search stack, DataSets.PROJECT
.
May be renamed in a future version.
DataSet metadata model
The DataSet
is a holder for dataset metadata, including the type of the data and the method for access (the storage driver - see Storage Drivers). DataSet
s are managed in projects which may be stacked together. The library provides several subtypes of DataSets.AbstractDataProject
for this purpose which are listed below. (Most users will simply to configure the global data project via DataSets.PROJECT
.)
DataSets.DataSet
— TypeA DataSet
is a metadata overlay for data held locally or remotely which is unopinionated about the underlying storage mechanism.
The data in a DataSet
has a type which implies an index; the index can be used to partition the data for processing.
DataSets.AbstractDataProject
— TypeSubtypes of AbstractDataProject
have the interface
Must implement:
Base.get(project, dataset_name, default)
— searchBase.keys(project)
- get dataset names
Optional:
Base.iterate()
— default implementation in terms ofkeys
andget
Base.pairs()
— default implementation in terms ofkeys
andget
Base.haskey()
— default implementation in terms ofget
Base.getindex()
— default implementation in terms ofget
DataSets.project_name()
— returnsnothing
by default.
Provided by AbstractDataProject (should not be overridden):
DataSets.dataset()
- implemented in terms ofget
DataSets.DataProject
— TypeDataProject
A concrete data project is a collection of DataSets with associated names. Names are unique within the project.
DataSets.StackedDataProject
— TypeStackedDataProject()
StackedDataProject(projects)
Search stack of AbstractDataProjects, where projects are searched from the first to last element of projects
.
Additional projects may be added or removed from the stack with pushfirst!
, push!
and empty!
.
See also DataSets.PROJECT
.
DataSets.ActiveDataProject
— TypeData project, based on the location of the current explicitly selected Julia Project.toml, as reported by Base.active_project(false)
.
Several factors make the implementation a bit complicated:
- The active project may change at any time without warning
- The active project may be
nothing
when no explicit project is selected - There might be no Data.toml for the active project
- The user can change Data.toml interactively and we'd like that to be reflected within the program.
DataSets.TomlFileDataProject
— TypeData project which automatically updates based on a TOML file on the local filesystem.
Data Models for files and directories
DataSets provides some builtin data models Blob
and BlobTree
for accessin file- and directory-like data respectively. For modifying these, the functions newfile
and newdir
can be used, together with setindex!
for BlobTree
.
DataSets.Blob
— TypeBlob(root)
Blob(root, relpath)
Blob
represents the location of a collection of unstructured binary data. The location is a path relpath
relative to some root
data resource.
A Blob
can naturally be open()
ed as a Vector{UInt8}
, but can also be mapped into the program as an IO
byte stream, or interpreted as a String
.
Blobs can be arranged into hierarchies "directories" via the BlobTree
type.
DataSets.BlobTree
— TypeBlobTree(root)
BlobTree
is a "directory tree" like hierarchy which may have Blob
s and BlobTree
s as children.
The tree implements the AbstracTrees.children()
interface and may be indexed with paths to traverse the hierarchy down to the leaves ("files") which are of type Blob
. Individual leaves may be open()
ed as various Julia types.
Example
Normally you'd construct these via the dataset
function which takes care of constructing the correct root
object. However, here's a direct demonstration:
julia> tree = BlobTree(DataSets.FileSystemRoot(dirname(pathof(DataSets))), path"../test/data")
📂 Tree ../test/data @ /home/chris/.julia/dev/DataSets/src
📁 csvset
📄 file.txt
📄 foo.txt
📄 people.csv.gz
julia> tree["csvset"]
📂 Tree ../test/data/csvset @ /home/chris/.julia/dev/DataSets/src
📄 1.csv
📄 2.csv
julia> tree[path"csvset"]
📂 Tree ../test/data/csvset @ /home/chris/.julia/dev/DataSets/src
📄 1.csv
📄 2.csv
DataSets.newfile
— Functionnewfile(func)
newfile(func, ctx)
Create a new temporary Blob
object which may be later assigned to a permanent location in a BlobTree
. If not assigned to a permanent location, the temporary file is cleaned up during garbage collection.
Example
tree[path"some/demo/path.txt"] = newfile() do io
println(io, "Hi there!")
end
DataSets.newdir
— Functionnewdir()
Create a new temporary BlobTree
which can have files assigned into it and may be assigned to a permanent location in a persistent BlobTree
. If not assigned to a permanent location, the temporary tree is cleaned up during garbage collection.
Storage Drivers
To add a new kind of data storage backend, implement DataSets.add_storage_driver
DataSets.add_storage_driver
— Functionadd_storage_driver(driver_name=>storage_opener)
Associate DataSet storage driver named driver_name
with storage_opener
. When a dataset
with storage.driver == driver_name
is opened, storage_opener(user_func, storage_config, dataset)
will be called. Any existing storage driver registered to driver_name
will be overwritten.
As a matter of convention, storage_opener
should generally take configuration from storage_config
which is just dataset.storage
. But to avoid config duplication it may also use the content of dataset
, (for example, dataset.uuid).
Packages which define new storage drivers should generally call add_storage_driver()
within their __init__()
functions.