API Reference

Using datasets

The primary mechanism for loading datasets is the dataset function, coupled with open() to open the resulting DataSet as some Julia type. In addition, DataSets.jl provides two macros @datafunc and @datarun to help in creating program entry points and running them.

DataSets.datasetFunction
dataset(name)
dataset(project, name)

Returns the DataSet with the given name from project. If omitted, the global data environment DataSets.PROJECT will be used.

The DataSet is metadata, but to use the actual data in your program you need to use the open function to access the DataSet's content as a given Julia type.

name is the name of the dataset, or more generally a "data specification": a URI-like object of the form namespace/name?params#fragment.

Example

To open a dataset named "a_text_file" and read the whole content as a String,

content = open(String, dataset("a_text_file"))

To open the same dataset as an IO stream and read only the first line,

open(IO, dataset("a_text_file")) do io
    line = readline(io)
    @info "The first line is" line
end

To open a directory as a browsable tree object,

open(FileTree, dataset("a_tree_example"))
source
DataSets.@datafuncMacro
@datafunc function f(x::DT=>T, y::DS=>S...)
    ...
end

Define the function f(x::T, y::S, ...) and add data dispatch rules so that f(x::DataSet, y::DataSet) will open datasets matching dataset types DT,DS as Julia types T,S.

source
DataSets.@datarunMacro
@datarun [proj] func(args...)

Run func with the named DataSets from the list args.

Example

Load DataSets named a,b as defined in Data.toml, and pass them to f().

proj = DataSets.load_project("Data.toml")
@datarun proj f("a", "b")
source

Data environment

The global data environment for the session is defined by DataSets.PROJECT which is initialized from the JULIA_DATASETS_PATH environment variable. To load a data project from a particular TOML file, use DataSets.load_project.

DataSets.PROJECTConstant

DataSets.PROJECT contains the default global data environment for the Julia process. This is created from the JULIA_DATASETS_PATH environment variable at initialization which is a list of paths (separated by : or ; on windows).

In analogy to Base.LOAD_PATH and Base.DEPOT_PATH, the path components are interpreted as follows:

  • @ means the path of the current active project as returned by Base.active_project(false) This can be useful when you're "doing scripting" and you've got a project-specific Data.toml which resides next to the Project.toml. This only applies to projects which are explicitly set with julia --project or Pkg.activate().
  • Explicit paths may be either directories or files in Data.toml format. For directories, the filename "Data.toml" is implicitly appended. expanduser() is used to expand the user's home directory.
  • As in DEPOT_PATH, an empty path component means the user's default Julia home directory, joinpath(homedir(), ".julia", "datasets")

This simplified version of the code loading rules (LOADPATH/DEPOTPATH) is used as it seems unlikely that we'll want data location to be version- dependent in the same way that that code is.

Unlike LOAD_PATH, JULIA_DATASETS_PATH is represented inside the program as a StackedDataProject, and users can add custom projects by defining their own AbstractDataProject subtypes.

Additional projects may be added or removed from the stack with pushfirst!, push! and empty!.

source
DataSets.load_projectFunction
load_project(path; auto_update=false)
load_project(config_dict)

Load a data project from a system path referring to a TOML file. If auto_update is true, the returned project will monitor the file for updates and reload when necessary.

Alternatively, create a DataProject from a an existing dictionary config_dict, which should be in the Data.toml format.

See also load_project!.

source

DataSet metadata model

The DataSet is a holder for dataset metadata, including the type of the data and the method for access (the storage driver - see Storage Drivers). DataSets are managed in projects which may be stacked together. The library provides several subtypes of DataSets.AbstractDataProject for this purpose which are listed below. (Most users will simply to configure the global data project via DataSets.PROJECT.)

DataSets.DataSetType

A DataSet is a metadata overlay for data held locally or remotely which is unopinionated about the underlying storage mechanism.

The data in a DataSet has a type which implies an index; the index can be used to partition the data for processing.

source
DataSets.AbstractDataProjectType

Subtypes of AbstractDataProject have the interface

Must implement:

  • Base.get(project, dataset_name, default) — search
  • Base.keys(project) - get dataset names

Optional:

  • Base.iterate() — default implementation in terms of keys and get
  • Base.pairs() — default implementation in terms of keys and get
  • Base.haskey() — default implementation in terms of get
  • Base.getindex() — default implementation in terms of get
  • DataSets.project_name() — returns nothing by default.

Provided by AbstractDataProject (should not be overridden):

  • DataSets.dataset() - implemented in terms of get
source
DataSets.StackedDataProjectType
StackedDataProject()
StackedDataProject(projects)

Search stack of AbstractDataProjects, where projects are searched from the first to last element of projects.

Additional projects may be added or removed from the stack with pushfirst!, push! and empty!.

See also DataSets.PROJECT.

source
DataSets.ActiveDataProjectType

Data project, based on the location of the current explicitly selected Julia Project.toml, as reported by Base.active_project(false).

Several factors make the implementation a bit complicated:

  • The active project may change at any time without warning
  • The active project may be nothing when no explicit project is selected
  • There might be no Data.toml for the active project
  • The user can change Data.toml interactively and we'd like that to be reflected within the program.
source

Data Models for files and directories

DataSets provides some builtin data models File and FileTree for accessin file- and directory-like data respectively. For modifying these, the functions newfile and newdir can be used.

DataSets.FileType
File(root)
File(root, relpath)

File represents the location of a collection of unstructured binary data. The location is a path relpath relative to some root data resource.

A File can naturally be open()ed as a Vector{UInt8}, but can also be mapped into the program as an IO byte stream, or interpreted as a String.

Files can be arranged into hierarchies "directories" via the FileTree type.

source
DataSets.FileTreeType
newdir()
FileTree(root)

Create a FileTree which is a "directory tree" like hierarchy which may have Files and FileTrees as children. newdir() creates the tree in a temporary directory on the local filesystem. Alternative roots may be supplied which store the data elsewhere.

The tree implements the AbstractTrees.children() interface and may be indexed with /-separated paths to traverse the hierarchy down to the leaves which are of type File. Individual leaves may be open()ed as various Julia types.

Operations on FileTree

FileTree has a largely dictionary-like interface:

  • List keys (ie, file and directory names): keys(tree)
  • List keys,value pairs: pairs(tree)
  • Query keys: haskey(tree)
  • Traverse the tree: tree["path"], tree["multi/component/path"]
  • Add new content: newdir(tree, "path"), newfile(tree, "path")
  • Delete content: delete!(tree, "path")

Iteration of FileTree iterates values (not key value pairs). This has some benefits - for example, broadcasting processing across files in a directory.

  • Property access
    • isdir(), isfile() - determine whether a child of tree is a directory or file.
    • filesize() — size of File elements in a tree

Example

Create a new temporary FileTree via the newdir() function and fill it with files via newfile():

julia> dir = newdir()
       for i = 1:3
           newfile(dir, "$i/a.txt") do io
               println(io, "Content of a")
           end
           newfile(dir, "b-$i.txt") do io
               println(io, "Content of b")
           end
       end
       dir
📂 Tree  @ /tmp/jl_Sp6wMF
 📁 1
 📁 2
 📁 3
 📄 b-1.txt
 📄 b-2.txt
 📄 b-3.txt

Create a FileTree from a local directory with DataSets.from_path():

julia> using Pkg
       open(DataSets.from_path(joinpath(Pkg.dir("DataSets"), "src")))
📂 Tree  @ ~/.julia/dev/DataSets/src
 📄 DataSet.jl
 📄 DataSets.jl
 📄 DataTomlStorage.jl
 ...
source
DataSets.newfileFunction
newfile(tree, path; overwrite=false)
newfile(tree, path; overwrite=false) do io ...

Create a new file object in the tree at the given path. In the second form, the open file io will be passed to the do block.

newfile()

Create a new file which may be later assigned to a permanent location in a tree. If not assigned to a permanent location, the temporary file is cleaned up during garbage collection.

Example

newfile(tree, "some/demo/path.txt") do io
    println(io, "Hi there!")
end
source
DataSets.newdirFunction
newdir(tree, path; overwrite=false)

Create a new FileTree ("directory") at tree[path] and return it. If overwrite=true, remove any existing tree before creating the new one.

source
newdir()

Create a new FileTree on the local temporary directory. If not moved to a permanent location (for example, with some_tree["name"] = newdir()) the temporary tree will be cleaned up during garbage collection.

source

Storage Drivers

To add a new kind of data storage backend, implement DataSets.add_storage_driver

DataSets.add_storage_driverFunction
add_storage_driver(driver_name=>storage_opener)

Associate DataSet storage driver named driver_name with storage_opener. When a dataset with storage.driver == driver_name is opened, storage_opener(user_func, storage_config, dataset) will be called. Any existing storage driver registered to driver_name will be overwritten.

As a matter of convention, storage_opener should generally take configuration from storage_config which is just dataset.storage. But to avoid config duplication it may also use the content of dataset, (for example, dataset.uuid).

Packages which define new storage drivers should generally call add_storage_driver() within their __init__() functions.

source