Datasets¶
GraphDatasets are collections of AtomicGraphs.
We provide a base class, GraphDataset, together with several
implementations. The most common way to get a dataset of graphs is to use
load_atoms_dataset() or file_dataset().
Useful Datasets¶
- graph_pes.data.file_dataset(
- path,
- cutoff,
- n=None,
- shuffle=True,
- seed=42,
- pre_transform=True,
- property_map=None,
- others_to_include=None,
Load an ASE dataset from a file that is either:
any plain-text file that can be read by
ase.io.read(), e.g. an.xyzfilea
.dbfile containing a SQLite database ofase.Atomsobjects that is readable as an ASE database. Under the hood, this uses theASEDatabaseclass - see there for more details.
Examples
Load a dataset from a file, ensuring that the
energyproperty is mapped toU0:>>> file_dataset( ... "training_data.xyz", ... cutoff=5.0, ... property_map={"U0": "energy"}, ... )
By default, this function gets called on any collection of arguments specified in a YAML configuration file:
data: train: path: training_data.xyz n: 1000 cutoff: 5.0 property_map: U0: energy
is equivalent to:
data: train: +file_dataset: path: training_data.xyz cutoff: 5.0 property_map: U0: energy
- Parameters:
path (str | pathlib.Path) – The path to the file.
cutoff (float) – The cutoff radius for the neighbour list.
n (int | None) – The number of structures to load. If
None, all structures are loaded.shuffle (bool) – Whether to shuffle the structures.
seed (int) – The random seed used for shuffling.
pre_transform (bool) – Whether to pre-calculate the neighbour lists for each structure.
property_map (dict[str, PropertyKey] | None) – A mapping from properties as named on the atoms objects to
graph-pesproperty keys, e.g.{"U0": "energy"}.others_to_include (list[str] | None) – A list of properties to include in the
graph.otherfield that are present as per-atom or per-structure properties on thease.Atomsobjects.
- Returns:
The ASE dataset.
- Return type:
- graph_pes.data.load_atoms_dataset(
- id,
- cutoff,
- n_train,
- n_valid,
- n_test=None,
- split='random',
- seed=42,
- pre_transform=True,
- property_map=None,
- others_to_include=None,
Load an dataset of
ase.Atomsobjects using load-atoms, convert them toAtomicGraphinstances, and split into train and valid sets.Examples
Load a subset of the QM9 dataset. Ensure that the
U0property is mapped toenergy:>>> load_atoms_dataset( ... "QM9", ... cutoff=5.0, ... n_train=1_000, ... n_valid=100, ... n_test=100, ... property_map={"U0": "energy"}, ... )
Use this to specify a complete collection of datasets (train, val and test) in a YAML configuration file:
data: +load_atoms_dataset: id: QM9 cutoff: 5.0 n_train: 1_000 n_valid: 100 n_test: 100 property_map: U0: energy
- Parameters:
id (str | pathlib.Path) – The dataset identifier. Can be a
load-atomsid, or a path to anase-readable data file.cutoff (float) – The cutoff radius for the neighbor list.
n_train (int) – The number of training structures.
n_valid (int) – The number of validation structures.
n_test (int | None) – The number of test structures. If
None, no test set is created.split (Literal['random', 'sequential']) – The split method.
"random"shuffles the structures before choosing a non-overlapping split, while"sequential"takes the firstn_trainstructures for training and the nextn_validstructures for validation.seed (int) – The random seed.
pre_transform (bool) – Whether to pre-calculate the neighbour lists for each structure.
root – The root directory
property_map (dict[str, PropertyKey] | None) – A mapping from properties as named on the atoms objects to
graph-pesproperty keys, e.g.{"U0": "energy"}.others_to_include (list[str] | None) – A list of properties to include in the
graph.otherfield that are present as per-atom or per-structure properties on thease.Atomsobjects.
- Returns:
A collection of training, validation, and optional test datasets.
- Return type:
- class graph_pes.data.ConcatDataset[source]¶
Bases:
GraphDatasetA dataset that concatenates multiple
GraphDatasetinstances. Useful for e.g. training on datasets from multiple files simultaneously:data: train: +ConcatDataset: dimers: +file_dataset: path: dimers.xyz cutoff: 5.0 crystals: +file_dataset: path: crystals.xyz cutoff: 5.0 valid: ...
- Parameters:
datasets (GraphDataset) – The collection of
GraphDatasetinstances to concatenate. The keys are arbitrary names for the datasets, and the values are theGraphDatasetinstances.
Base Classes¶
- class graph_pes.data.GraphDataset[source]¶
-
A dataset of
AtomicGraphinstances.- Parameters:
graphs (Sequence[AtomicGraph]) – The collection of
AtomicGraphinstances.
- prepare_data()[source]¶
Make general preparations for loading the data for the dataset.
Called on rank-0 only: don’t set any state here. May be called multiple times.
- class graph_pes.data.ASEToGraphDataset[source]¶
Bases:
GraphDatasetA dataset that wraps a
Sequenceofase.Atoms, and converts them toAtomicGraphinstances.- Parameters:
structures (Sequence[ase.Atoms]) – The collection of
ase.Atomsobjects to convert toAtomicGraphinstances.cutoff (float) – The cutoff to use when creating neighbour indexes for the graphs.
pre_transform (bool) – Whether to precompute the the
AtomicGraphobjects, or only do so on-the-fly when the dataset is accessed. This pre-computations stores the graphs in memory, and so will be prohibitively expensive for large datasets.property_mapping (Mapping[str, PropertyKey] | None) – A mapping from properties defined on the
ase.Atomsobjects to their appropriate names ingraph-pes, seefrom_ase().others_to_include (list[str] | None) – A list of properties to include in the
graph.otherfield that are present as per-atom or per-structure properties on thease.Atomsobjects.
Utilities¶
- class graph_pes.data.ase_db.ASEDatabase(path)[source]¶
-
A class that wraps an ASE database file, allowing for indexing into the database to obtain
ase.Atomsobjects.We assume that each row contains labels in the
dataattribute, as a mapping from property names to values, and that units are “standard” ASE units, e.g.eV,eV/Å, etc.Fully compatible with SchNetPack Dataset Files.
See the ASE documentation for more details about this file format.
Warning
This dataset indexes into a database, performing many random access reads from disk. This can be very slow! If you are using a distributed compute cluster, ensure you copy your database file to somewhere with fast local storage (as opposed to network-attached storage).
Similarly, consider using several workers when loading the dataset, e.g.
fitting/loader_kwargs/num_workers=8.- Parameters:
path (str | pathlib.Path) – The path to the database.