Dataset Loading

load-atoms exposes the ability to download named datasets from the internet. The DatabaseEntry class defines a schema for the metadata of such named datasets. These are serialised as a .yaml file for each dataset, and hosted in our repo on GitHub.

Loading a dataset by name is handled within load_dataset_by_id(). Calling this function for the first time will trigger the following steps:

  1. We download the associated DatabaseEntry file to root/database-entries/name.yaml.

  2. We check if the dataset is compatible with the current version of load-atoms.

  3. If the dataset hasn’t been cached yet: a. We download and execute a dataset-specific importer script. b. The importer is responsible for downloading and processing the raw data files. c. The importer returns an AtomsDataset object.

  4. We cache the AtomsDataset object to root/name.pkl or root/name.lmdb.

  5. We display usage information, including license and citation details.

Subsequent calls to load_dataset_by_id() with the same dataset name will simply read the cached AtomsDataset object from disk. This usually takes less than 1 second.

load_atoms.database.backend.load_dataset_by_id(dataset_id: str, root: Path) AtomsDataset[source]

Load the AtomsDataset and corresponding DatabaseEntry for the given dataset id, saving the dataset to the given root directory.

Parameters:
  • name – The id of the dataset to load.

  • root – The root folder to save the structures to.

class load_atoms.database.DatabaseEntry[source]

Holds all the required metadata for a named dataset, such that it can be automatically downloaded using load_dataset(), and so that documentation can be automatically generated.

name: str

The name of the dataset

year: int

The year the dataset was created

description: str

A description of the dataset (in .rst format)

category: str

The category of the dataset (e.g. "Potential Fitting", "Benchmarks")

format: Literal['lmdb', 'memory']

The format of the dataset

minimum_load_atoms_version: str | None

The minimum version of load-atoms that is required to load the dataset.

citation: str | None

A citation for the dataset (in BibTeX format)

license: str | None

The license identifier of the dataset (e.g. "CC BY-NC-SA 4.0")

representative_structure: int | None

The index of a representative structure (for visualisation purposes)

per_atom_properties: Dict[str, PropertyDescription] | None

A mapping from per-atom properties to their descriptions

per_structure_properties: Dict[str, PropertyDescription] | None

A mapping from per-structure properties to their descriptions