rMD17

A dataset composed of a single MD trajectory for each of 10 molecules. Original structures are taken from Chmiela et al., with energy and force labels recalculated by Christensen and von Lilienfeld using “the PBE/def2-SVP level of theory [with] very tight SCF convergence and [a] very dense DFT integration grid”. The MD trajectories are presented one at a time, with structures within each trajectory in chronological order.

>>> from load_atoms import load_dataset
>>> load_dataset("rMD17")
rMD17:
    structures: 999,988
    atoms: 15,599,712
    species:
        H: 44.23%
        C: 43.59%
        O: 8.97%
        N: 3.21%
    properties:
        per atom: (forces)
        per structure: (energy, name)

Citation

If you use this dataset in your work, please cite the following:

@article{Christensen-20-10,
    title = {
        On the Role of Gradients for Machine
        Learning of Molecular Energies and Forces
    },
    author = {Christensen, Anders S. and von Lilienfeld, O. Anatole},
    year = {2020},
    journal = {Machine Learning: Science and Technology},
    volume = {1},
    number = {4},
    pages = {045018},
    doi = {10.1088/2632-2153/abba6f},
}

@article{Chmiela-17-05,
    title = {
        Machine Learning of Accurate Energy-Conserving Molecular Force Fields
    },
    author = {
        Chmiela, Stefan and Tkatchenko, Alexandre
        and Sauceda, Huziel E. and Poltavsky, Igor
        and Sch{\"u}tt, Kristof T. and M{\"u}ller, Klaus-Robert
    },
    year = {2017},
    journal = {Science Advances},
    volume = {3},
    number = {5},
    pages = {e1603015},
    doi = {10.1126/sciadv.1603015},
}

Properties

Per-atom:

Property

Units

Type

Description

forces

eV/Å

ndarray(N, 3)

forces

Per-structure:

Property

Units

Type

Description

energy

eV

float64

energy

name

str

str

name of the molecule

Miscellaneous information

rMD17 is imported as an InMemoryAtomsDataset:

Importer script for rMD17
from __future__ import annotations

from pathlib import Path
from typing import Iterator

import numpy as np
from ase import Atoms
from ase.units import eV, kcal, mol
from load_atoms.database.backend import BaseImporter, unzip_file
from load_atoms.database.internet import FileDownload
from load_atoms.progress import Progress


class Importer(BaseImporter):
    @classmethod
    def files_to_download(cls) -> list[FileDownload]:
        return [
            FileDownload(
                url="https://figshare.com/ndownloader/files/23950376",
                expected_hash="cddeea2ec2c4",
                local_name="rmd17.tar.bz2",
            )
        ]

    @classmethod
    def get_structures(
        cls, tmp_dir: Path, progress: Progress
    ) -> Iterator[Atoms]:
        # Unzip the file
        contents_path = (
            unzip_file(tmp_dir / "rmd17.tar.bz2", progress) / "rmd17/npz_data"
        )

        # Process each npz archive
        structure_names = "aspirin benzene ethanol malonaldehyde naphthalene paracetamol salicylic toluene uracil azobenzene".split()  # noqa: E501
        assert len(structure_names) == 10

        for structure_name in structure_names:
            archive_path = contents_path / f"rmd17_{structure_name}.npz"
            archive = np.load(archive_path)
            Z = archive["nuclear_charges"]
            coords = archive["coords"]
            energy = archive["energies"]
            forces = archive["forces"]

            for idx in np.argsort(archive["old_indices"]):
                structure = Atoms(numbers=Z, positions=coords[idx])
                structure.info["name"] = structure_name
                structure.info["energy"] = energy[idx] / (kcal / mol) * eV
                structure.arrays["forces"] = forces[idx] / (kcal / mol) * eV
                yield structure
DatabaseEntry for rMD17
name: rMD17
year: 2020
description: |
    A dataset composed of a single MD trajectory for each of 10 molecules.
    Original structures are taken from Chmiela et al., with energy and force
    labels recalculated by Christensen and von Lilienfeld using "the PBE/def2-SVP
    level of theory [with] very tight SCF convergence and [a] very dense DFT
    integration grid". The MD trajectories are presented one at a time, with
    structures within each trajectory in chronological order.
category: Benchmarks
minimum_load_atoms_version: 0.2
per_structure_properties:
    energy:
        desc: energy
        units: eV
    name:
        desc: name of the molecule
        units: str
per_atom_properties:
    forces:
        desc: forces
        units: eV/Å
representative_structure: 0
citation: |
    @article{Christensen-20-10,
        title = {
            On the Role of Gradients for Machine
            Learning of Molecular Energies and Forces
        },
        author = {Christensen, Anders S. and von Lilienfeld, O. Anatole},
        year = {2020},
        journal = {Machine Learning: Science and Technology},
        volume = {1},
        number = {4},
        pages = {045018},
        doi = {10.1088/2632-2153/abba6f},
    }

    @article{Chmiela-17-05,
        title = {
            Machine Learning of Accurate Energy-Conserving Molecular Force Fields
        },
        author = {
            Chmiela, Stefan and Tkatchenko, Alexandre
            and Sauceda, Huziel E. and Poltavsky, Igor
            and Sch{\"u}tt, Kristof T. and M{\"u}ller, Klaus-Robert
        },
        year = {2017},
        journal = {Science Advances},
        volume = {3},
        number = {5},
        pages = {e1603015},
        doi = {10.1126/sciadv.1603015},
    }