API Reference
Using datasets
The primary mechanism for loading datasets is the dataset
function, coupled with open()
to open the resulting DataSet
as some Julia type. In addition, DataSets.jl provides two macros @datafunc
and @datarun
to help in creating program entry points and running them.
DataSets.dataset
— Functiondataset(name)
dataset(project, name)
Returns the DataSet
with the given name
from project
. If omitted, the global data environment DataSets.PROJECT
will be used.
The DataSet
is metadata, but to use the actual data in your program you need to use the open
function to access the DataSet
's content as a given Julia type.
name
is the name of the dataset, or more generally a "data specification": a URI-like object of the form namespace/name?params#fragment
.
Example
To open a dataset named "a_text_file"
and read the whole content as a String,
content = open(String, dataset("a_text_file"))
To open the same dataset as an IO
stream and read only the first line,
open(IO, dataset("a_text_file")) do io
line = readline(io)
@info "The first line is" line
end
To open a directory as a browsable tree object,
open(FileTree, dataset("a_tree_example"))
DataSets.@datafunc
— Macro@datafunc function f(x::DT=>T, y::DS=>S...)
...
end
Define the function f(x::T, y::S, ...)
and add data dispatch rules so that f(x::DataSet, y::DataSet)
will open datasets matching dataset types DT,DS
as Julia types T,S
.
DataSets.@datarun
— Macro@datarun [proj] func(args...)
Run func
with the named DataSet
s from the list args
.
Example
Load DataSet
s named a,b as defined in Data.toml, and pass them to f()
.
proj = DataSets.load_project("Data.toml")
@datarun proj f("a", "b")
Data environment
The global data environment for the session is defined by DataSets.PROJECT
which is initialized from the JULIA_DATASETS_PATH
environment variable. To load a data project from a particular TOML file, use DataSets.load_project
.
DataSets.PROJECT
— ConstantDataSets.PROJECT
contains the default global data environment for the Julia process. This is created from the JULIA_DATASETS_PATH
environment variable at initialization which is a list of paths (separated by :
or ;
on windows).
In analogy to Base.LOAD_PATH
and Base.DEPOT_PATH
, the path components are interpreted as follows:
@
means the path of the current active project as returned byBase.active_project(false)
This can be useful when you're "doing scripting" and you've got a project-specific Data.toml which resides next to the Project.toml. This only applies to projects which are explicitly set withjulia --project
orPkg.activate()
.- Explicit paths may be either directories or files in Data.toml format. For directories, the filename "Data.toml" is implicitly appended.
expanduser()
is used to expand the user's home directory. - As in
DEPOT_PATH
, an empty path component means the user's default Julia home directory,joinpath(homedir(), ".julia", "datasets")
This simplified version of the code loading rules (LOADPATH/DEPOTPATH) is used as it seems unlikely that we'll want data location to be version- dependent in the same way that that code is.
Unlike LOAD_PATH
, JULIA_DATASETS_PATH
is represented inside the program as a StackedDataProject
, and users can add custom projects by defining their own AbstractDataProject
subtypes.
Additional projects may be added or removed from the stack with pushfirst!
, push!
and empty!
.
DataSets.load_project
— Functionload_project(path; auto_update=false)
load_project(config_dict)
Load a data project from a system path
referring to a TOML file. If auto_update
is true, the returned project will monitor the file for updates and reload when necessary.
Alternatively, create a DataProject
from a an existing dictionary config_dict
, which should be in the Data.toml format.
See also load_project!
.
DataSets.load_project!
— Functionload_project!(path_or_config)
Prepends to the default global dataset search stack, DataSets.PROJECT
.
May be renamed in a future version.
DataSet metadata model
The DataSet
is a holder for dataset metadata, including the type of the data and the method for access (the storage driver - see Storage Drivers). DataSet
s are managed in projects which may be stacked together. The library provides several subtypes of DataSets.AbstractDataProject
for this purpose which are listed below. (Most users will simply to configure the global data project via DataSets.PROJECT
.)
DataSets.DataSet
— TypeA DataSet
is a metadata overlay for data held locally or remotely which is unopinionated about the underlying storage mechanism.
The data in a DataSet
has a type which implies an index; the index can be used to partition the data for processing.
DataSets.AbstractDataProject
— TypeSubtypes of AbstractDataProject
have the interface
Must implement:
Base.get(project, dataset_name, default)
— searchBase.keys(project)
- get dataset names
Optional:
Base.iterate()
— default implementation in terms ofkeys
andget
Base.pairs()
— default implementation in terms ofkeys
andget
Base.haskey()
— default implementation in terms ofget
Base.getindex()
— default implementation in terms ofget
DataSets.project_name()
— returnsnothing
by default.
Provided by AbstractDataProject (should not be overridden):
DataSets.dataset()
- implemented in terms ofget
DataSets.DataProject
— TypeDataProject
A in-memory collection of DataSets.
DataSets.StackedDataProject
— TypeStackedDataProject()
StackedDataProject(projects)
Search stack of AbstractDataProjects, where projects are searched from the first to last element of projects
.
Additional projects may be added or removed from the stack with pushfirst!
, push!
and empty!
.
See also DataSets.PROJECT
.
DataSets.ActiveDataProject
— TypeData project, based on the location of the current explicitly selected Julia Project.toml, as reported by Base.active_project(false)
.
Several factors make the implementation a bit complicated:
- The active project may change at any time without warning
- The active project may be
nothing
when no explicit project is selected - There might be no Data.toml for the active project
- The user can change Data.toml interactively and we'd like that to be reflected within the program.
DataSets.TomlFileDataProject
— TypeData project which automatically updates based on a TOML file on the local filesystem.
Data Models for files and directories
DataSets provides some builtin data models File
and FileTree
for accessin file- and directory-like data respectively. For modifying these, the functions newfile
and newdir
can be used.
DataSets.File
— TypeFile(root)
File(root, relpath)
File
represents the location of a collection of unstructured binary data. The location is a path relpath
relative to some root
data resource.
A File
can naturally be open()
ed as a Vector{UInt8}
, but can also be mapped into the program as an IO
byte stream, or interpreted as a String
.
Files can be arranged into hierarchies "directories" via the FileTree
type.
DataSets.FileTree
— Typenewdir()
FileTree(root)
Create a FileTree
which is a "directory tree" like hierarchy which may have File
s and FileTree
s as children. newdir()
creates the tree in a temporary directory on the local filesystem. Alternative root
s may be supplied which store the data elsewhere.
The tree implements the AbstractTrees.children()
interface and may be indexed with /
-separated paths to traverse the hierarchy down to the leaves which are of type File
. Individual leaves may be open()
ed as various Julia types.
Operations on FileTree
FileTree
has a largely dictionary-like interface:
- List keys (ie, file and directory names):
keys(tree)
- List keys,value pairs:
pairs(tree)
- Query keys:
haskey(tree)
- Traverse the tree:
tree["path"]
,tree["multi/component/path"]
- Add new content:
newdir(tree, "path")
,newfile(tree, "path")
- Delete content:
delete!(tree, "path")
Iteration of FileTree iterates values (not key value pairs). This has some benefits - for example, broadcasting processing across files in a directory.
- Property access
isdir()
,isfile()
- determine whether a child of tree is a directory or file.filesize()
— size ofFile
elements in a tree
Example
Create a new temporary FileTree via the newdir()
function and fill it with files via newfile()
:
julia> dir = newdir()
for i = 1:3
newfile(dir, "$i/a.txt") do io
println(io, "Content of a")
end
newfile(dir, "b-$i.txt") do io
println(io, "Content of b")
end
end
dir
📂 Tree @ /tmp/jl_Sp6wMF
📁 1
📁 2
📁 3
📄 b-1.txt
📄 b-2.txt
📄 b-3.txt
Create a FileTree
from a local directory with DataSets.from_path()
:
julia> using Pkg
open(DataSets.from_path(joinpath(Pkg.dir("DataSets"), "src")))
📂 Tree @ ~/.julia/dev/DataSets/src
📄 DataSet.jl
📄 DataSets.jl
📄 DataTomlStorage.jl
...
DataSets.newfile
— Functionnewfile(tree, path; overwrite=false)
newfile(tree, path; overwrite=false) do io ...
Create a new file object in the tree
at the given path
. In the second form, the open file io
will be passed to the do block.
newfile()
Create a new file which may be later assigned to a permanent location in a tree. If not assigned to a permanent location, the temporary file is cleaned up during garbage collection.
Example
newfile(tree, "some/demo/path.txt") do io
println(io, "Hi there!")
end
DataSets.newdir
— Functionnewdir(tree, path; overwrite=false)
Create a new FileTree ("directory") at tree[path] and return it. If overwrite=true
, remove any existing tree before creating the new one.
newdir()
Create a new FileTree
on the local temporary directory. If not moved to a permanent location (for example, with some_tree["name"] = newdir()
) the temporary tree will be cleaned up during garbage collection.
Storage Drivers
To add a new kind of data storage backend, implement DataSets.add_storage_driver
DataSets.add_storage_driver
— Functionadd_storage_driver(driver_name=>storage_opener)
Associate DataSet storage driver named driver_name
with storage_opener
. When a dataset
with storage.driver == driver_name
is opened, storage_opener(user_func, storage_config, dataset)
will be called. Any existing storage driver registered to driver_name
will be overwritten.
As a matter of convention, storage_opener
should generally take configuration from storage_config
which is just dataset.storage
. But to avoid config duplication it may also use the content of dataset
, (for example, dataset.uuid).
Packages which define new storage drivers should generally call add_storage_driver()
within their __init__()
functions.