Dataset Loading¶

load-atoms exposes the ability to download named datasets from the internet. The DatabaseEntry class defines a schema for the metadata of such named datasets. These are serialised as a .yaml file for each dataset, and hosted in our repo on GitHub.

Loading a dataset by name is handled within load_dataset_by_id(). Calling this function for the first time will trigger the following steps:

We download the associated DatabaseEntry file to root/database-entries/name.yaml.
We check if the dataset is compatible with the current version of load-atoms.
If the dataset hasn’t been cached yet: a. We download and execute a dataset-specific importer script. b. The importer is responsible for downloading and processing the raw data files. c. The importer returns an AtomsDataset object.
We cache the AtomsDataset object to root/name.pkl or root/name.lmdb.
We display usage information, including license and citation details.

Subsequent calls to load_dataset_by_id() with the same dataset name will simply read the cached AtomsDataset object from disk. This usually takes less than 1 second.

load_atoms.database.backend.load_dataset_by_id(dataset_id: str, root: Path) → AtomsDataset[source]¶

Load the AtomsDataset and corresponding DatabaseEntry for the given dataset id, saving the dataset to the given root directory.

Parameters:

name – The id of the dataset to load.
root – The root folder to save the structures to.

class load_atoms.database.DatabaseEntry[source]¶

Holds all the required metadata for a named dataset, such that it can be automatically downloaded using load_dataset(), and so that documentation can be automatically generated.

name: str¶: The name of the dataset

year: int¶: The year the dataset was created

description: str¶: A description of the dataset (in .rst format)

category: str¶: The category of the dataset (e.g. "Potential Fitting", "Benchmarks")

format: Literal['lmdb', 'memory']¶: The format of the dataset

minimum_load_atoms_version: str | None¶: The minimum version of load-atoms that is required to load the dataset.

citation: str | None¶: A citation for the dataset (in BibTeX format)

license: str | None¶: The license identifier of the dataset (e.g. "CC BY-NC-SA 4.0")

representative_structure: int | None¶: The index of a representative structure (for visualisation purposes)

per_atom_properties: Dict[str, PropertyDescription] | None¶: A mapping from per-atom properties to their descriptions

per_structure_properties: Dict[str, PropertyDescription] | None¶: A mapping from per-structure properties to their descriptions