Dataset Loading¶
load-atoms exposes the ability to download named datasets from the internet.
The DatabaseEntry class defines a schema for
the metadata of such named datasets. These are serialised as a .yaml file for each dataset,
and hosted in our repo on GitHub.
Loading a dataset by name is handled within load_dataset_by_id().
Calling this function for the first time will trigger the following steps:
We download the associated
DatabaseEntryfile toroot/database-entries/name.yaml.We check if the dataset is compatible with the current version of load-atoms.
If the dataset hasn’t been cached yet: a. We download and execute a dataset-specific importer script. b. The importer is responsible for downloading and processing the raw data files. c. The importer returns an
AtomsDatasetobject.We cache the
AtomsDatasetobject toroot/name.pklorroot/name.lmdb.We display usage information, including license and citation details.
Subsequent calls to load_dataset_by_id() with the same dataset name will
simply read the cached AtomsDataset object from disk. This usually takes less than 1 second.
- load_atoms.database.backend.load_dataset_by_id(dataset_id: str, root: Path) AtomsDataset[source]¶
Load the
AtomsDatasetand correspondingDatabaseEntryfor the given dataset id, saving the dataset to the givenrootdirectory.- Parameters:
name – The id of the dataset to load.
root – The root folder to save the structures to.
- class load_atoms.database.DatabaseEntry[source]¶
Holds all the required metadata for a named dataset, such that it can be automatically downloaded using
load_dataset(), and so that documentation can be automatically generated.- minimum_load_atoms_version: str | None¶
The minimum version of load-atoms that is required to load the dataset.
- representative_structure: int | None¶
The index of a representative structure (for visualisation purposes)