C-SYNTH-23M¶
The complete “synthetic” dataset of carbon structures from Synthetic Data Enable Experiments in Atomistic Machine Learning. This dataset comprises 546 uncorrelated MD trajectories, each containing 200 atoms, driven by the C-GAP-17 interatomic potential, and sampled every 1ps. The structures cover a wide range of densities, temperatures and degrees of dis/order.
>>> from load_atoms import load_dataset
>>> load_dataset("C-SYNTH-23M")
C-SYNTH-23M:
    structures: 115,206
    atoms: 23,041,200
    species:
        C: 100.00%
    properties:
        per atom: (forces, local_energies)
        per structure: (anneal_T, density, energy, run_id, time)
License¶
This dataset is licensed under the MIT license.
Citation¶
If you use this dataset in your work, please cite the following:
@article{Gardner-23-03,
  title = {
    Synthetic Data Enable Experiments in Atomistic Machine Learning
  },
  author = {
    Gardner, John L. A. and Beaulieu, Zo{\'e} Faure
    and Deringer, Volker L.
  },
  year = {2023},
  journal = {Digital Discovery},
  doi = {10.1039/D2DD00137C},
}
Properties¶
Per-atom:
Property  | 
Units  | 
Type  | 
Description  | 
|---|---|---|---|
  | 
eV/Å  | 
force vectors (C-GAP-17)  | 
|
  | 
eV  | 
local energies (C-GAP-17)  | 
Per-structure:
Property  | 
Units  | 
Type  | 
Description  | 
|---|---|---|---|
  | 
eV  | 
  | 
total energy of the structure (C-GAP-17)  | 
  | 
K  | 
  | 
annealing temperature  | 
  | 
g cm\({}^{-3}\)  | 
  | 
density of the structure  | 
  | 
  | 
unique identifier for the trajectory  | 
|
  | 
ps  | 
  | 
timestep of the structure in the trajectory  | 
Miscellaneous information¶
C-SYNTH-23M is imported as an
LmdbAtomsDataset:
Importer script for C-SYNTH-23M
from __future__ import annotations
from pathlib import Path
from typing import Iterator
import ase.io
from ase import Atoms
from load_atoms.database.backend import BaseImporter, rename, unzip_file
from load_atoms.database.internet import FileDownload
from load_atoms.progress import Progress
class Importer(BaseImporter):
    @classmethod
    def files_to_download(cls) -> list[FileDownload]:
        return [
            FileDownload(
                url="https://zenodo.org/records/7704087/files/jla-gardner/carbon-data-v1.0.zip",
                expected_hash="b43fc702ef6d",
            )
        ]
    @classmethod
    def get_structures(
        cls, tmp_dir: Path, progress: Progress
    ) -> Iterator[Atoms]:
        # Unzip the file
        contents_path = unzip_file(tmp_dir / "carbon-data-v1.0.zip", progress)
        extxyz_files = sorted(contents_path.glob("**/*.extxyz"))
        task = progress.new_task(
            f"Processing {len(extxyz_files)} .extxyz files",
            total=len(extxyz_files),
        )
        # iterate through all .extxyz files
        for file_path in extxyz_files:
            structures = ase.io.read(file_path, index=":")
            assert isinstance(structures, list)
            for structure in structures:
                yield process_structure(structure)
            task.update(advance=1)
def process_structure(structure: Atoms) -> Atoms:
    structure = rename(
        structure,
        {
            "gap17_forces": "forces",
            "gap17_energy": "local_energies",
        },
    )
    structure.info["energy"] = structure.arrays["local_energies"].sum()
    return structure
DatabaseEntry for C-SYNTH-23M
name: C-SYNTH-23M
year: 2022
description: |
    The complete "synthetic" dataset of carbon structures from `Synthetic Data Enable Experiments in Atomistic Machine Learning <https://doi.org/10.1039/D2DD00137C>`_.
    This dataset comprises 546 uncorrelated MD trajectories, each containing 200 atoms, driven by the `C-GAP-17 <https://doi.org/10.1103/PhysRevB.95.094203>`_ interatomic potential,
    and sampled every 1ps. The structures cover a wide range of densities, temperatures and degrees of dis/order.
category: Synthetic Data
license: MIT
minimum_load_atoms_version: 0.2
format: lmdb
citation: |
    @article{Gardner-23-03,
      title = {
        Synthetic Data Enable Experiments in Atomistic Machine Learning
      },
      author = {
        Gardner, John L. A. and Beaulieu, Zo{\'e} Faure
        and Deringer, Volker L.
      },
      year = {2023},
      journal = {Digital Discovery},
      doi = {10.1039/D2DD00137C},
    }
representative_structure: 199
per_atom_properties:
    forces:
        desc: force vectors (C-GAP-17)
        units: eV/Å
    local_energies:
        desc: local energies (C-GAP-17)
        units: eV
per_structure_properties:
    energy:
        desc: total energy of the structure (C-GAP-17)
        units: eV
    anneal_T:
        desc: annealing temperature
        units: K
    density:
        desc: density of the structure
        units: g cm\ :math:`{}^{-3}`
    run_id:
        desc: unique identifier for the trajectory
    time:
        desc: timestep of the structure in the trajectory
        units: ps
# TODO: remove after Dec 2024
# backwards compatability: unused as of 0.3.0
files:
     - url: https://zenodo.org/records/7704087/files/jla-gardner/carbon-data-v1.0.zip
       hash: b43fc702ef6d
processing:
     - UnZip
     - ForEachFile:
           pattern: "**/*.extxyz"
           steps:
               - ReadASE
     - Rename:
           gap17_forces: forces
           gap17_energy: local_energies