C-SYNTH-23M

The complete “synthetic” dataset of carbon structures from Synthetic Data Enable Experiments in Atomistic Machine Learning. This dataset comprises 546 uncorrelated MD trajectories, each containing 200 atoms, driven by the C-GAP-17 interatomic potential, and sampled every 1ps. The structures cover a wide range of densities, temperatures and degrees of dis/order.

>>> from load_atoms import load_dataset
>>> load_dataset("C-SYNTH-23M")
C-SYNTH-23M:
    structures: 115,206
    atoms: 23,041,200
    species:
        C: 100.00%
    properties:
        per atom: (forces, local_energies)
        per structure: (anneal_T, density, energy, run_id, time)

License

This dataset is licensed under the MIT license.

Citation

If you use this dataset in your work, please cite the following:

@article{Gardner-23-03,
  title = {
    Synthetic Data Enable Experiments in Atomistic Machine Learning
  },
  author = {
    Gardner, John L. A. and Beaulieu, Zo{\'e} Faure
    and Deringer, Volker L.
  },
  year = {2023},
  journal = {Digital Discovery},
  doi = {10.1039/D2DD00137C},
}

Properties

Per-atom:

Property

Units

Type

Description

forces

eV/Å

ndarray(N, 3)

force vectors (C-GAP-17)

local_energies

eV

ndarray(N,)

local energies (C-GAP-17)

Per-structure:

Property

Units

Type

Description

energy

eV

float64

total energy of the structure (C-GAP-17)

anneal_T

K

int64

annealing temperature

density

g cm\({}^{-3}\)

float64

density of the structure

run_id

int64

unique identifier for the trajectory

time

ps

int64

timestep of the structure in the trajectory

Miscellaneous information

C-SYNTH-23M is imported as an LmdbAtomsDataset:

Importer script for C-SYNTH-23M
from __future__ import annotations

from pathlib import Path
from typing import Iterator

import ase.io
from ase import Atoms
from load_atoms.database.backend import BaseImporter, rename, unzip_file
from load_atoms.database.internet import FileDownload
from load_atoms.progress import Progress


class Importer(BaseImporter):
    @classmethod
    def files_to_download(cls) -> list[FileDownload]:
        return [
            FileDownload(
                url="https://zenodo.org/records/7704087/files/jla-gardner/carbon-data-v1.0.zip",
                expected_hash="b43fc702ef6d",
            )
        ]

    @classmethod
    def get_structures(
        cls, tmp_dir: Path, progress: Progress
    ) -> Iterator[Atoms]:
        # Unzip the file
        contents_path = unzip_file(tmp_dir / "carbon-data-v1.0.zip", progress)

        extxyz_files = sorted(contents_path.glob("**/*.extxyz"))
        task = progress.new_task(
            f"Processing {len(extxyz_files)} .extxyz files",
            total=len(extxyz_files),
        )

        # iterate through all .extxyz files
        for file_path in extxyz_files:
            structures = ase.io.read(file_path, index=":")
            assert isinstance(structures, list)
            for structure in structures:
                yield process_structure(structure)

            task.update(advance=1)


def process_structure(structure: Atoms) -> Atoms:
    structure = rename(
        structure,
        {
            "gap17_forces": "forces",
            "gap17_energy": "local_energies",
        },
    )
    structure.info["energy"] = structure.arrays["local_energies"].sum()
    return structure
DatabaseEntry for C-SYNTH-23M
name: C-SYNTH-23M
year: 2022
description: |
    The complete "synthetic" dataset of carbon structures from `Synthetic Data Enable Experiments in Atomistic Machine Learning <https://doi.org/10.1039/D2DD00137C>`_.
    This dataset comprises 546 uncorrelated MD trajectories, each containing 200 atoms, driven by the `C-GAP-17 <https://doi.org/10.1103/PhysRevB.95.094203>`_ interatomic potential,
    and sampled every 1ps. The structures cover a wide range of densities, temperatures and degrees of dis/order.
category: Synthetic Data
license: MIT
minimum_load_atoms_version: 0.2
format: lmdb
citation: |
    @article{Gardner-23-03,
      title = {
        Synthetic Data Enable Experiments in Atomistic Machine Learning
      },
      author = {
        Gardner, John L. A. and Beaulieu, Zo{\'e} Faure
        and Deringer, Volker L.
      },
      year = {2023},
      journal = {Digital Discovery},
      doi = {10.1039/D2DD00137C},
    }
representative_structure: 199
per_atom_properties:
    forces:
        desc: force vectors (C-GAP-17)
        units: eV/Å
    local_energies:
        desc: local energies (C-GAP-17)
        units: eV
per_structure_properties:
    energy:
        desc: total energy of the structure (C-GAP-17)
        units: eV
    anneal_T:
        desc: annealing temperature
        units: K
    density:
        desc: density of the structure
        units: g cm\ :math:`{}^{-3}`
    run_id:
        desc: unique identifier for the trajectory
    time:
        desc: timestep of the structure in the trajectory
        units: ps


# TODO: remove after Dec 2024
# backwards compatability: unused as of 0.3.0
files:
     - url: https://zenodo.org/records/7704087/files/jla-gardner/carbon-data-v1.0.zip
       hash: b43fc702ef6d
processing:
     - UnZip
     - ForEachFile:
           pattern: "**/*.extxyz"
           steps:
               - ReadASE
     - Rename:
           gap17_forces: forces
           gap17_energy: local_energies