Dataset¶
The main entry point of load-atoms is the load_dataset function:
- load_atoms.load_dataset(thing: str | list[ase.Atoms] | Path, root: str | Path | None = None) AtomsDataset[source]¶
Load a dataset by name or from a list of structures.
- Parameters:
thing – A dataset id, a list of structures, or a path to a file.
root – The root directory to use when loading a dataset by id. If not provided, the default root directory (
~/.load-atoms) will be used.
Examples
Load a dataset by id:
>>> from load_atoms import load_dataset >>> dataset = load_dataset("QM9") ╭───────────────────────────────── QM9 ─────────────────────────────────╮ │ │ │ Downloading dsgdb9nsd.xyz.tar.bz2 ━━━━━━━━━━━━━━━━━━━━ 100% 00:09 │ │ Extracting dsgdb9nsd.xyz.tar.bz2 ━━━━━━━━━━━━━━━━━━━━ 100% 00:18 │ │ Processing files ━━━━━━━━━━━━━━━━━━━━ 100% 00:19 │ │ Caching to disk ━━━━━━━━━━━━━━━━━━━━ 100% 00:02 │ │ │ │ The QM9 dataset is covered by the CC0 license. │ │ Please cite the QM9 dataset if you use it in your work. │ │ For more information about the QM9 dataset, visit: │ │ load-atoms/QM9 │ ╰───────────────────────────────────────────────────────────────────────╯ >>> dataset QM9: structures: 133,885 atoms: 2,407,753 species: H: 51.09% C: 35.16% O: 7.81% N: 5.80% F: 0.14% properties: per atom: (partial_charges) per structure: ( A, B, C, Cv, G, H, U, U0, alpha, frequencies, gap, geometry, homo, inchi, index, lumo, mu, r2, smiles, zpve )
Optionally save a dataset to an explicit root directory:
>>> load_dataset("QM9", root="./my-datasets")
Wrap a list of structures in a dataset:
>>> load_dataset([Atoms("H2O"), Atoms("H2O2")])
Load a dataset from a file:
>>> load_dataset("path/to/file.xyz")
Note
As of
ase==0.3.9, the"energy","forces", and"stress"special keys are loaded into aSinglePointCalculatorobject, and removed from the.infoand.arraysdictionaries on the atoms object. We reverse this process when loading a dataset from file.
- class load_atoms.AtomsDataset[source]¶
An abstract base class for datasets of
ase.Atomsobjects.This class provides a common interface for interacting with datasets of atomic structures, abstracting over the underlying storage mechanism.
The two current concrete implementations are
InMemoryAtomsDatasetandLmdbAtomsDataset.- abstract property structure_sizes: ndarray¶
An array containing the number of atoms in each structure, such that:
- __getitem__(index: int) Atoms[source]¶
- __getitem__(index: list[int] | list[bool] | np.ndarray | slice) Self
Get the structure(s) at the given index(es).
If a single
intis provided, the corresponding structure is returned:>>> QM7 = load_dataset("QM7") >>> QM7[0] Atoms(symbols='CH4', pbc=False)
If a
sliceis provided, a newAtomsDatasetis returned containing the structures in the slice:>>> QM7[:5] Dataset: structures: 5 atoms: 32 species: H: 68.75% C: 28.12% O: 3.12% properties: per atom: () per structure: (energy)
If a
listornumpy.ndarrayofints is provided, a newAtomsDatasetis returned containing the structures at the given indices:>>> len(QM7[[0, 2, 4]]) 3
If a
listornumpy.ndarrayofbools is provided with the same length as the dataset, a newAtomsDatasetis returned containing the structures where the boolean isTrue(see alsofilter_by()):- Parameters:
index – The index(es) to get the structure(s) at.
- write(path: Path | str, format: str | None = None, append: bool = False, **kwargs: Any)[source]¶
Write the dataset to a file, using
ase.io.write().- Parameters:
path – The path to write the dataset to.
format – The format to write the dataset in.
append – Whether to append to the file.
kwargs – Additional keyword arguments to pass to
ase.io.write().
- abstract property info: Mapping[str, Any]¶
Get a mapping from keys that are shared across all structures’
.infoattributes to the concatenated corresponding values.The returned mapping conforms to:
for key, value in dataset.info.items(): for i, structure in enumerate(dataset): assert structure.info[key] == value[i]
- abstract property arrays: Mapping[str, ndarray]¶
Get a mapping from each structure’s
.arrayskeys to arrays.The returned mapping conforms to:
for key, value in dataset.arrays.items(): assert value.shape[0] == dataset.n_atoms assert value == np.vstack( [structure.arrays[key] for structure in dataset] )
- abstract classmethod save(path: Path, structures: Iterable[Atoms], description: DatabaseEntry | None = None)[source]¶
Save the dataset to a file.
- Parameters:
path – The path to save the dataset to.
structures – The structures to save to the dataset.
description – The description of the dataset.
- abstract classmethod load(path: Path) Self[source]¶
Load the dataset from a file.
- Parameters:
path – The path to load the dataset from.
- property n_atoms: int¶
The total number of atoms in the dataset.
This is equivalent to the sum of the number of atoms in each structure.
- filter_by(*functions: Callable[[Atoms], bool], **info_kwargs: Any) Self[source]¶
Return a new dataset containing only the structures that match the given criteria.
- Parameters:
functions – Functions to filter the dataset by. Each function should take an ASE Atoms object as input and return a boolean.
info_kwargs – Keyword arguments to filter the dataset by. Only atoms objects with matching info keys and values will be returned.
Example
Get small, amorphous structures with large forces:
>>> from load_atoms import load_dataset >>> dataset = load_dataset("C-GAP-17") >>> dataset.filter_by( ... lambda structure: len(structure) < 50, ... lambda structure: structure.arrays["force"].max() > 5, ... config_type="bulk_amo" ... ) Dataset: structures: 609 atoms: 23,169 species: C: 100.00% properties: per atom: (force) per structure: (config_type, detailed_ct, split, energy)
- random_split(splits: Sequence[float] | Sequence[int], seed: int = 42, keep_ratio: str | None = None) list[Self][source]¶
Randomly split the dataset into multiple, disjoint parts.
- Parameters:
splits – The number of structures to put in each split. If a list of
floats, the splits will be calculated as a fraction of the dataset size.seed – The random seed to use for shuffling the dataset.
keep_ratio – If not
None, splits will be generated to maintain the ratio of structures in each split with the specified.infovalue.
- Returns:
A list of new datasets, each containing a subset of the original
- Return type:
list[Self]
Examples
Split a
datasetinto 80% training and 20% test sets:>>> train, test = dataset.random_split([0.8, 0.2])
Split a
datasetinto 3 parts:>>> train, val, test = dataset.random_split([1_000, 100, 100])
Maintain the ratio of
config_typevalues in each split:>>> from load_atoms import load_dataset >>> import numpy as np >>> # helper function >>> def ratios(thing): ... values, counts = np.unique(thing, return_counts=True) ... max_len = max(len(str(v)) for v in values) ... for v, c in zip(values, counts / counts.sum()): ... print(f"{v:>{max_len}}: {c:>6.2%}") ... >>> dataset = load_dataset("C-GAP-17") >>> ratios(dataset.info["config_type"]) bulk_amo: 75.28% bulk_cryst: 8.83% dimer: 0.66% surf_amo: 15.23% >>> train, val, test = dataset.random_split( ... [0.6, 0.2, 0.2], ... keep_ratio="config_type" ... ) >>> ratios(train.info["config_type"]) bulk_amo: 75.28% bulk_cryst: 8.83% dimer: 0.66% surf_amo: 15.23%
- k_fold_split(k: int = 5, fold: int = 0, shuffle: bool = True, seed: int = 42, keep_ratio: str | None = None) tuple[Self, Self][source]¶
Generate (an optionally shuffled) train/test split for cross-validation.
- Parameters:
k – The number of folds to use.
fold – The fold to use for testing.
shuffle – Whether to shuffle the dataset before splitting.
seed – The random seed to use for shuffling the dataset.
keep_ratio – If not
None, splits will be generated to maintain the ratio of structures in each split with the specified.infovalue.
- Returns:
The train and test datasets.
- Return type:
Tuple[Self, Self]
Example
Basic usage:
>>> for i in range(5): ... train, test = dataset.k_fold_split(k=5, fold=i) ... ... # do something, e.g. train a model
Maintain the ratio of
config_typevalues in each split (see alsorandom_split()for a more detailed example of this feature):>>> train, test = dataset.k_fold_split( ... k=5, fold=0, keep_ratio="config_type" ... )
- class load_atoms.atoms_dataset.InMemoryAtomsDataset[source]¶
An in-memory implementation of
AtomsDataset.Internally, this class wraps a
listofase.Atomsobjects, all of which are stored in RAM. Suitable for small to moderately large datasets.
- class load_atoms.atoms_dataset.LmdbAtomsDataset[source]¶
An LMDB-backed implementation of
AtomsDataset.Internally, this class wraps an
lmdb.Environmentobject, which stores the dataset in an LMDB database. Suitable for large datasets that cannot fit in memory. Accessing data from this dataset type is (marginally) slower than forInMemoryAtomsDatasets, but allows for efficient processing of extremely large datasets that cannot otherwise fit in memory.