ANI-1ccx

The ANI-1ccx dataset comprises an “optimally spanning” subset of the ANI-1x dataset, with each structure being re-labelled with the total structure energy using the “gold standard” CCSD(T)/CBS level of theory. Internall, files are downloaded from FigShare.

>>> from load_atoms import load_dataset
>>> load_dataset("ANI-1ccx")
ANI-1ccx:
    structures: 489,571
    atoms: 6,763,288
    species:
        H: 45.52%
        C: 29.58%
        N: 15.30%
        O: 9.60%
    properties:
        per atom: (dft_forces)
        per structure: (1x_idx, cc_energy, dft_dipole, dft_energy)

License

This dataset is licensed under the CC0 license.

Citation

If you use this dataset in your work, please cite the following:

@article{Smith-19-07,
    title = {
        Approaching Coupled Cluster Accuracy with a
        General-Purpose Neural Network Potential
        through Transfer Learning
    },
    author = {
        Smith, Justin S. and Nebgen, Benjamin T. and Zubatyuk, Roman
        and Lubbers, Nicholas and Devereux, Christian and Barros, Kipton
        and Tretiak, Sergei and Isayev, Olexandr and Roitberg, Adrian E.
    },
    year = {2019},
    journal = {Nature Communications},
    volume = {10},
    number = {1},
    pages = {2903},
    doi = {10.1038/s41467-019-10827-4},
}

@article{Smith-20-05,
    title = {
        The ANI-1ccx and ANI-1x Data Sets, Coupled-Cluster
        and Density Functional Theory Properties for Molecules
    },
    author = {
        Smith, Justin S. and Zubatyuk, Roman and Nebgen, Benjamin and
        Lubbers, Nicholas and Barros, Kipton and Roitberg, Adrian E. and
        Isayev, Olexandr and Tretiak, Sergei
    },
    year = {2020},
    journal = {Scientific Data},
    volume = {7},
    number = {1},
    pages = {134},
    doi = {10.1038/s41597-020-0473-z},
}


@article{Smith-18-05,
    title = {
        Less Is More: Sampling Chemical Space with Active Learning
    },
    author = {
        Smith, Justin S. and Nebgen, Ben and Lubbers, Nicholas and
        Isayev, Olexandr and Roitberg, Adrian E.
    },
    year = {2018},
    journal = {The Journal of Chemical Physics},
    volume = {148},
    number = {24},
    doi = {10.1063/1.5023802},
}

Properties

Per-atom:

Property

Units

Type

Description

dft_forces

eV/Å

ndarray(N, 3)

force vectors (as labelled with \(\omega\)B97x/6-31G(d))

Per-structure:

Property

Units

Type

Description

cc_energy

eV

float64

energy of the structure (as labelled with CCSD(T)/CBS)

dft_energy

eV

float64

energy of the structure (as labelled with \(\omega\)B97x/6-31G(d))

dft_dipole

e Å

ndarray(3,)

dipole moment of the structure (as labelled with \(\omega\)B97x/6-31G(d))

1x_idx

int

index of the structure in the ANI-1x dataset

Miscellaneous information

ANI-1ccx is imported as an LmdbAtomsDataset:

Importer script for ANI-1ccx
from __future__ import annotations

from pathlib import Path
from typing import Iterator

import h5py
import numpy as np
from ase import Atoms
from load_atoms.database.backend import BaseImporter, FileDownload
from load_atoms.progress import Progress

Ha_to_eV = 27.2114079527


class Importer(BaseImporter):
    @classmethod
    def permanent_download_dirname(cls) -> str | None:
        return "ANI"  # ensure same as ANI-1x to avoid re-downloading

    @classmethod
    def files_to_download(cls) -> list[FileDownload]:
        return [
            FileDownload(
                url="https://springernature.figshare.com/ndownloader/files/18112775",
                expected_hash="fe0ba06198ee",
                local_name="ani1x-release.h5",
            )
        ]

    @classmethod
    def get_structures(
        cls, tmp_dir: Path, progress: Progress
    ) -> Iterator[Atoms]:
        with h5py.File(tmp_dir / "ani1x-release.h5", "r") as f:
            n_structures = sum(
                (~np.isnan(data["ccsd(t)_cbs.energy"][()])).sum()
                for data in f.values()
            )
            task = progress.new_task(
                "Processing 500k structures",
                total=n_structures,
            )
            ani1x_idx = -1

            # iterate over each chemical formula in the dataset:
            for data in f.values():
                Zs = data["atomic_numbers"]
                coords = data["coordinates"][()]
                cc_energy = data["ccsd(t)_cbs.energy"][()]
                dft_energy = data["wb97x_dz.energy"][()]
                dft_forces = data["wb97x_dz.forces"][()]
                dft_dipole = data["wb97x_dz.dipole"][()]

                for i in range(len(cc_energy)):
                    ani1x_idx += 1

                    if np.isnan(cc_energy[i]):
                        continue

                    structure = Atoms(
                        positions=coords[i],
                        numbers=Zs,
                    )
                    # see: https://www.nature.com/articles/s41597-020-0473-z/tables/2
                    # energy is in hartree, convert to eV
                    structure.info["dft_energy"] = dft_energy[i] * Ha_to_eV
                    structure.info["cc_energy"] = cc_energy[i] * Ha_to_eV
                    # units of e * angstrom
                    structure.info["dft_dipole"] = dft_dipole[i]
                    structure.info["1x_idx"] = ani1x_idx
                    # forces are in hartree/angstrom, convert to eV/angstrom
                    structure.arrays["dft_forces"] = dft_forces[i] * Ha_to_eV

                    task.update(advance=1)
                    yield structure
DatabaseEntry for ANI-1ccx
name: ANI-1ccx
year: 2019
category: Benchmarks
license: CC0
minimum_load_atoms_version: 0.3
format: lmdb
description: |
    The ANI-1ccx dataset comprises an "optimally spanning" subset of the :doc:`/datasets/ANI-1x` dataset,
    with each structure being re-labelled with the total structure energy using the
    "gold standard" CCSD(T)/CBS level of theory. Internall, files are downloaded from
    `FigShare <https://springernature.figshare.com/collections/The_ANI-1ccx_and_ANI-1x_data_sets_coupled-cluster_and_density_functional_theory_properties_for_molecules/4712477>`__.
citation: |
    @article{Smith-19-07,
        title = {
            Approaching Coupled Cluster Accuracy with a
            General-Purpose Neural Network Potential
            through Transfer Learning
        },
        author = {
            Smith, Justin S. and Nebgen, Benjamin T. and Zubatyuk, Roman
            and Lubbers, Nicholas and Devereux, Christian and Barros, Kipton
            and Tretiak, Sergei and Isayev, Olexandr and Roitberg, Adrian E.
        },
        year = {2019},
        journal = {Nature Communications},
        volume = {10},
        number = {1},
        pages = {2903},
        doi = {10.1038/s41467-019-10827-4},
    }

    @article{Smith-20-05,
        title = {
            The ANI-1ccx and ANI-1x Data Sets, Coupled-Cluster
            and Density Functional Theory Properties for Molecules
        },
        author = {
            Smith, Justin S. and Zubatyuk, Roman and Nebgen, Benjamin and
            Lubbers, Nicholas and Barros, Kipton and Roitberg, Adrian E. and
            Isayev, Olexandr and Tretiak, Sergei
        },
        year = {2020},
        journal = {Scientific Data},
        volume = {7},
        number = {1},
        pages = {134},
        doi = {10.1038/s41597-020-0473-z},
    }


    @article{Smith-18-05,
        title = {
            Less Is More: Sampling Chemical Space with Active Learning
        },
        author = {
            Smith, Justin S. and Nebgen, Ben and Lubbers, Nicholas and
            Isayev, Olexandr and Roitberg, Adrian E.
        },
        year = {2018},
        journal = {The Journal of Chemical Physics},
        volume = {148},
        number = {24},
        doi = {10.1063/1.5023802},
    }


per_atom_properties:
    dft_forces:
        desc: force vectors (as labelled with :math:`\omega`\ B97x/6-31G(d))
        units: eV/Å
per_structure_properties:
    cc_energy:
        desc: energy of the structure (as labelled with CCSD(T)/CBS)
        units: eV
    dft_energy:
        desc: energy of the structure (as labelled with :math:`\omega`\ B97x/6-31G(d))
        units: eV
    dft_dipole:
        desc: dipole moment of the structure (as labelled with :math:`\omega`\ B97x/6-31G(d))
        units: e Å
    1x_idx:
        desc: index of the structure in the :doc:`/datasets/ANI-1x` dataset

representative_structure: 413_000