Dataset Loading¶
load-atoms
exposes the ability to download named datasets from the internet.
The DatabaseEntry
class defines a schema for
the metadata of such named datasets. These are serialised as a .yaml
file for each dataset,
and hosted in our repo on GitHub.
Loading a dataset by name is handled within load_dataset_by_id()
.
Calling this function for the first time will trigger the following steps:
We download the associated
DatabaseEntry
file toroot/database-entries/name.yaml
.We check if the dataset is compatible with the current version of load-atoms.
If the dataset hasn’t been cached yet: a. We download and execute a dataset-specific importer script. b. The importer is responsible for downloading and processing the raw data files. c. The importer returns an
AtomsDataset
object.We cache the
AtomsDataset
object toroot/name.pkl
orroot/name.lmdb
.We display usage information, including license and citation details.
Subsequent calls to load_dataset_by_id()
with the same dataset name will
simply read the cached AtomsDataset
object from disk. This usually takes less than 1 second.
- load_atoms.database.backend.load_dataset_by_id(dataset_id: str, root: Path) AtomsDataset [source]¶
Load the
AtomsDataset
and correspondingDatabaseEntry
for the given dataset id, saving the dataset to the givenroot
directory.- Parameters:
name – The id of the dataset to load.
root – The root folder to save the structures to.
- class load_atoms.database.DatabaseEntry[source]¶
Holds all the required metadata for a named dataset, such that it can be automatically downloaded using
load_dataset()
, and so that documentation can be automatically generated.- minimum_load_atoms_version: str | None¶
The minimum version of load-atoms that is required to load the dataset.
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}¶
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- representative_structure: int | None¶
The index of a representative structure (for visualisation purposes)