ANI-1x

The ANI-1x dataset is a comprehensive collection of labelled molecular structures designed for training machine learned potentials. ANI-1x was generated using an active learning approach to produce a diverse and useful dataset covering the chemical space of organic molecules composed of C, H, N, and O atoms, Accurate energy and force labels are provided for each structure using the \(\omega\)B97x/6-31G(d) level of theory. Internall, files are downloaded from FigShare.

>>> from load_atoms import load_dataset
>>> load_dataset("ANI-1x")
ANI-1x:
    structures: 4,956,005
    atoms: 75,700,481
    species:
        H: 47.63%
        C: 30.30%
        N: 13.32%
        O: 8.75%
    properties:
        per atom: (forces)
        per structure: (dipole, energy, is_in_ccx)

License

This dataset is licensed under the CC0 license.

Citation

If you use this dataset in your work, please cite the following:

@article{Smith-18-05,
    title = {
        Less Is More: Sampling Chemical Space with Active Learning
    },
    author = {
        Smith, Justin S. and Nebgen, Ben and Lubbers, Nicholas and
        Isayev, Olexandr and Roitberg, Adrian E.
    },
    year = {2018},
    journal = {The Journal of Chemical Physics},
    volume = {148},
    number = {24},
    doi = {10.1063/1.5023802},
}

Properties

Per-atom:

Property

Units

Type

Description

forces

eV/Å

ndarray(N, 3)

force vectors (as labelled with \(\omega\)B97x/6-31G(d))

Per-structure:

Property

Units

Type

Description

energy

eV

float64

energy of the structure (as labelled with \(\omega\)B97x/6-31G(d))

dipole

e Å

ndarray(3,)

dipole moment of the structure (as labelled with \(\omega\)B97x/6-31G(d))

is_in_ccx

bool

whether the structure is in the ANI-1ccx subset

Miscellaneous information

ANI-1x is imported as an LmdbAtomsDataset:

Importer script for ANI-1x
from __future__ import annotations

from pathlib import Path
from typing import Iterator

import h5py
import numpy as np
from ase import Atoms
from load_atoms.database.backend import (
    BaseImporter,
    FileDownload,
)
from load_atoms.progress import Progress

Ha_to_eV = 27.2114079527


class Importer(BaseImporter):
    @classmethod
    def files_to_download(cls) -> list[FileDownload]:
        return [
            FileDownload(
                url="https://springernature.figshare.com/ndownloader/files/18112775",
                expected_hash="fe0ba06198ee",
                local_name="ani1x-release.h5",
            )
        ]

    @classmethod
    def permanent_download_dirname(cls) -> str | None:
        # ensure this file download is persisted so that we don't
        # have to re-download the same file for the ANI-1ccx dataset
        return "ANI"

    @classmethod
    def get_structures(
        cls, tmp_dir: Path, progress: Progress
    ) -> Iterator[Atoms]:
        with h5py.File(tmp_dir / "ani1x-release.h5", "r") as f:
            n_structures = sum(
                data["coordinates"].shape[0] for data in f.values()
            )
            task = progress.new_task(
                "Processing 5M structures",
                total=n_structures,
            )

            # iterate over each chemical formula in the dataset:
            for data in f.values():
                Zs = data["atomic_numbers"]
                coords = data["coordinates"][()]
                dft_energy = data["wb97x_dz.energy"][()]
                dft_dipole = data["wb97x_dz.dipole"][()]
                dft_forces = data["wb97x_dz.forces"][()]
                cc_energy = data["ccsd(t)_cbs.energy"][()]

                for i in range(data["coordinates"].shape[0]):
                    structure = Atoms(positions=coords[i], numbers=Zs)
                    # see: https://www.nature.com/articles/s41597-020-0473-z/tables/2
                    # energy is in hartree, convert to eV
                    structure.info["energy"] = dft_energy[i] * Ha_to_eV
                    # units of e * angstrom
                    structure.info["dipole"] = dft_dipole[i]
                    structure.info["is_in_ccx"] = not np.isnan(cc_energy[i])
                    # forces are in hartree/angstrom, convert to eV/angstrom
                    structure.arrays["forces"] = dft_forces[i] * Ha_to_eV
                    task.update(advance=1)
                    yield structure
DatabaseEntry for ANI-1x
name: ANI-1x
year: 2018
category: Benchmarks
license: CC0
minimum_load_atoms_version: 0.3
format: lmdb
description: |
    The ANI-1x dataset is a comprehensive collection of labelled molecular structures designed for training machine learned potentials.
    ANI-1x was generated using an active learning approach to produce a diverse and useful dataset
    covering the chemical space of organic molecules composed of C, H, N, and O atoms,
    Accurate energy and force labels are provided for each structure using the :math:`\omega`\ B97x/6-31G(d) level of theory.
    Internall, files are downloaded from
    `FigShare <https://springernature.figshare.com/collections/The_ANI-1ccx_and_ANI-1x_data_sets_coupled-cluster_and_density_functional_theory_properties_for_molecules/4712477>`__.
citation: |
    @article{Smith-18-05,
        title = {
            Less Is More: Sampling Chemical Space with Active Learning
        },
        author = {
            Smith, Justin S. and Nebgen, Ben and Lubbers, Nicholas and
            Isayev, Olexandr and Roitberg, Adrian E.
        },
        year = {2018},
        journal = {The Journal of Chemical Physics},
        volume = {148},
        number = {24},
        doi = {10.1063/1.5023802},
    }

per_atom_properties:
    forces:
        desc: force vectors (as labelled with :math:`\omega`\ B97x/6-31G(d))
        units: eV/Å
per_structure_properties:
    energy:
        desc: energy of the structure (as labelled with :math:`\omega`\ B97x/6-31G(d))
        units: eV
    dipole:
        desc: dipole moment of the structure (as labelled with :math:`\omega`\ B97x/6-31G(d))
        units: e Å
    is_in_ccx:
        desc: whether the structure is in the :doc:`/datasets/ANI-1ccx` subset
representative_structure: 205_000