ANI-1x¶
The ANI-1x dataset is a comprehensive collection of labelled molecular structures designed for training machine learned potentials. ANI-1x was generated using an active learning approach to produce a diverse and useful dataset covering the chemical space of organic molecules composed of C, H, N, and O atoms, Accurate energy and force labels are provided for each structure using the \(\omega\)B97x/6-31G(d) level of theory. Internall, files are downloaded from FigShare.
>>> from load_atoms import load_dataset
>>> load_dataset("ANI-1x")
ANI-1x:
structures: 4,956,005
atoms: 75,700,481
species:
H: 47.63%
C: 30.30%
N: 13.32%
O: 8.75%
properties:
per atom: (forces)
per structure: (dipole, energy, is_in_ccx)
License¶
This dataset is licensed under the CC0 license.
Citation¶
If you use this dataset in your work, please cite the following:
@article{Smith-18-05,
title = {
Less Is More: Sampling Chemical Space with Active Learning
},
author = {
Smith, Justin S. and Nebgen, Ben and Lubbers, Nicholas and
Isayev, Olexandr and Roitberg, Adrian E.
},
year = {2018},
journal = {The Journal of Chemical Physics},
volume = {148},
number = {24},
doi = {10.1063/1.5023802},
}
Properties¶
Per-atom:
Property |
Units |
Type |
Description |
---|---|---|---|
|
eV/Å |
force vectors (as labelled with \(\omega\)B97x/6-31G(d)) |
Per-structure:
Property |
Units |
Type |
Description |
---|---|---|---|
|
eV |
|
energy of the structure (as labelled with \(\omega\)B97x/6-31G(d)) |
|
e Å |
dipole moment of the structure (as labelled with \(\omega\)B97x/6-31G(d)) |
|
|
whether the structure is in the ANI-1ccx subset |
Miscellaneous information¶
ANI-1x
is imported as an
LmdbAtomsDataset
:
Importer script for ANI-1x
from __future__ import annotations
from pathlib import Path
from typing import Iterator
import h5py
import numpy as np
from ase import Atoms
from load_atoms.database.backend import (
BaseImporter,
FileDownload,
)
from load_atoms.progress import Progress
Ha_to_eV = 27.2114079527
class Importer(BaseImporter):
@classmethod
def files_to_download(cls) -> list[FileDownload]:
return [
FileDownload(
url="https://springernature.figshare.com/ndownloader/files/18112775",
expected_hash="fe0ba06198ee",
local_name="ani1x-release.h5",
)
]
@classmethod
def permanent_download_dirname(cls) -> str | None:
# ensure this file download is persisted so that we don't
# have to re-download the same file for the ANI-1ccx dataset
return "ANI"
@classmethod
def get_structures(
cls, tmp_dir: Path, progress: Progress
) -> Iterator[Atoms]:
with h5py.File(tmp_dir / "ani1x-release.h5", "r") as f:
n_structures = sum(
data["coordinates"].shape[0] for data in f.values()
)
task = progress.new_task(
"Processing 5M structures",
total=n_structures,
)
# iterate over each chemical formula in the dataset:
for data in f.values():
Zs = data["atomic_numbers"]
coords = data["coordinates"][()]
dft_energy = data["wb97x_dz.energy"][()]
dft_dipole = data["wb97x_dz.dipole"][()]
dft_forces = data["wb97x_dz.forces"][()]
cc_energy = data["ccsd(t)_cbs.energy"][()]
for i in range(data["coordinates"].shape[0]):
structure = Atoms(positions=coords[i], numbers=Zs)
# see: https://www.nature.com/articles/s41597-020-0473-z/tables/2
# energy is in hartree, convert to eV
structure.info["energy"] = dft_energy[i] * Ha_to_eV
# units of e * angstrom
structure.info["dipole"] = dft_dipole[i]
structure.info["is_in_ccx"] = not np.isnan(cc_energy[i])
# forces are in hartree/angstrom, convert to eV/angstrom
structure.arrays["forces"] = dft_forces[i] * Ha_to_eV
task.update(advance=1)
yield structure
DatabaseEntry
for ANI-1x
name: ANI-1x
year: 2018
category: Benchmarks
license: CC0
minimum_load_atoms_version: 0.3
format: lmdb
description: |
The ANI-1x dataset is a comprehensive collection of labelled molecular structures designed for training machine learned potentials.
ANI-1x was generated using an active learning approach to produce a diverse and useful dataset
covering the chemical space of organic molecules composed of C, H, N, and O atoms,
Accurate energy and force labels are provided for each structure using the :math:`\omega`\ B97x/6-31G(d) level of theory.
Internall, files are downloaded from
`FigShare <https://springernature.figshare.com/collections/The_ANI-1ccx_and_ANI-1x_data_sets_coupled-cluster_and_density_functional_theory_properties_for_molecules/4712477>`__.
citation: |
@article{Smith-18-05,
title = {
Less Is More: Sampling Chemical Space with Active Learning
},
author = {
Smith, Justin S. and Nebgen, Ben and Lubbers, Nicholas and
Isayev, Olexandr and Roitberg, Adrian E.
},
year = {2018},
journal = {The Journal of Chemical Physics},
volume = {148},
number = {24},
doi = {10.1063/1.5023802},
}
per_atom_properties:
forces:
desc: force vectors (as labelled with :math:`\omega`\ B97x/6-31G(d))
units: eV/Å
per_structure_properties:
energy:
desc: energy of the structure (as labelled with :math:`\omega`\ B97x/6-31G(d))
units: eV
dipole:
desc: dipole moment of the structure (as labelled with :math:`\omega`\ B97x/6-31G(d))
units: e Å
is_in_ccx:
desc: whether the structure is in the :doc:`/datasets/ANI-1ccx` subset
representative_structure: 205_000