QM9

134k stable organic molecules made up of CHONF and containing up to 9 heavy atoms. Each molecule’s geometry was relaxed at the PM7 semi-empirical level of theory, before being labelled with DFT. For more information, see Quantum chemistry structures and properties of 134 kilo molecules. Internally, files are downloaded from FigShare,. Energy labels are quoted in eV, relative to the isolated atoms of the molecule.

>>> from load_atoms import load_dataset
>>> load_dataset("QM9")
QM9:
    structures: 133,885
    atoms: 2,407,753
    species:
        H: 51.09%
        C: 35.16%
        O: 7.81%
        N: 5.80%
        F: 0.14%
    properties:
        per atom: (partial_charges)
        per structure: (A, B, C, Cv, G, H, U, U0, alpha, frequencies, gap, geometry,
            homo, inchi, index, lumo, mu, r2, smiles, zpve)

License

This dataset is licensed under the CC0 license.

Citation

If you use this dataset in your work, please cite the following:

@article{Ramakrishnan-17,
   author={
        Ramakrishnan, Raghunathan and Dral, Pavlo O and
        Rupp, Matthias and von Lilienfeld, O Anatole
   },
   title = {Data for 133885 GDB-9 molecules},
   year = {2017},
   month = {6},
   doi = {10.6084/m9.figshare.978904_D12}
}
@article{Ramakrishnan-14,
    title={
        Quantum chemistry structures and properties of 134 kilo molecules
    },
    author={
        Ramakrishnan, Raghunathan and Dral, Pavlo O and
        Rupp, Matthias and von Lilienfeld, O Anatole
    },
    journal={Scientific Data},
    volume={1},
    year={2014},
    publisher={Nature Publishing Group}
}
@article{Ruddigkeit-12,
    title = {
        Enumeration of 166 {{Billion Organic Small Molecules}}
        in the {{Chemical Universe Database GDB-17}}
    },
    author = {
        Ruddigkeit, Lars and {van Deursen}, Ruud and
        Blum, Lorenz C. and Reymond, Jean-Louis
    },
    year = {2012},
    journal = {Journal of Chemical Information and Modeling},
    volume = {52},
    number = {11},
    pages = {2864--2875},
    doi = {10.1021/ci300415d},
}

Properties

Per-atom:

Property

Units

Type

Description

partial_charges

e

ndarray(N,)

Mulliken partial atomic charges

Per-structure:

Property

Units

Type

Description

index

float

consecutive index of molecule

A

GHz

float

Rotational constant A

B

GHz

float

Rotational constant B

C

GHz

float

Rotational constant C

mu

Debye

float

Dipole moment

alpha

Bohr\(^3\)

float

Isotropic polarizability

homo

eV

float

HOMO energy

lumo

eV

float

LUMO energy

gap

eV

float

HOMO-LUMO energy gap

r2

Bohr\(^2\)

float

electronic spatial extent

zpve

eV

float

zero point vibrational energy

U0

eV

float

internal energy at 0 K

U

eV

float

internal energy at 298.15 K

H

eV

float

enthalpy at 298.15 K

G

eV

float

free energy at 298.15 K

Cv

cal mol \(^{-1}\) K \(^{-1}\)

float

heat capacity at 298.15 K

frequencies

cm\(^{-1}\)

list

harmonic frequencies

geometry

bool

final geometry check passed

smiles

str

SMILES string

inchi

str

InChI identifier

Miscellaneous information

QM9 is imported as an InMemoryAtomsDataset:

Importer script for QM9
from __future__ import annotations

from io import StringIO
from pathlib import Path
from typing import Iterator

from ase import Atoms
from ase.io import read
from ase.units import Hartree, eV
from load_atoms.database.backend import BaseImporter, unzip_file
from load_atoms.database.internet import FileDownload
from load_atoms.progress import Progress


class Importer(BaseImporter):
    @classmethod
    def files_to_download(cls) -> list[FileDownload]:
        return [
            FileDownload(
                url="https://figshare.com/ndownloader/files/3195389",
                expected_hash="3a63848ac806",
                local_name="qm9.tar.bz2",
            )
        ]

    @classmethod
    def get_structures(
        cls, tmp_dir: Path, progress: Progress
    ) -> Iterator[Atoms]:
        # Unzip the file
        contents_path = unzip_file(tmp_dir / "qm9.tar.bz2", progress)

        # Process each XYZ file
        xyz_files = sorted(contents_path.glob("*.xyz"))
        total_files = len(xyz_files)

        with progress.new_task(
            "Processing QM9 structures", total=total_files
        ) as task:
            for xyz_file in xyz_files:
                yield read_qm9(xyz_file)
                task.update(advance=1)


PROPERTY_KEYS = "index A B C mu alpha homo lumo gap r2 zpve U0 U H G Cv".split()
assert len(PROPERTY_KEYS) == 16
RESCALINGS = [1.0] * 16
for property in "homo lumo gap zpve U0 U H G".split():
    # convert from Hartree to eV
    RESCALINGS[PROPERTY_KEYS.index(property)] = Hartree / eV

# taken from https://figshare.com/ndownloader/files/3195395
OFFSET_COLUMNS = "U0 U H G".split()
OFFSETS = {
    "H": [-0.5002730, -0.4988570, -0.4979120, -0.5109270],
    "C": [-37.846772, -37.845355, -37.844411, -37.861317],
    "N": [-54.583861, -54.582445, -54.581501, -54.598897],
    "O": [-75.064579, -75.063163, -75.062219, -75.079532],
    "F": [-99.718730, -99.717314, -99.716370, -99.733544],
}

# taken from https://figshare.com/ndownloader/files/3195404
BAD_IDS = [
    int(id)
    for id in (Path(__file__).parent.resolve() / "bad_qm9.txt")
    .read_text()
    .splitlines()
]


def read_qm9(file: Path) -> Atoms:
    """
    Read in a single XYZ file from the QM9 dataset, and return it
    as a list of a single Atoms object.

    See the original README for a specification of the format used:
    https://figshare.com/files/3195392
    """
    (
        n,
        property_values,
        *content,
        frequencies,
        smiles,
        inchi,
    ) = file.read_text().replace("*^", "e").splitlines()

    # fake an extxyz file and get ase to read it
    header = 'Properties=species:S:1:pos:R:3:partial_charges:R:1 pbc="F F F"'
    with StringIO("\n".join([n, header, *content])) as extxyz:
        atoms: Atoms = read(extxyz, 0, format="extxyz")  # type: ignore

    # ignore first "gdb" property
    property_values = [float(v) for v in property_values.split()[1:]]
    property_values[0] = int(property_values[0])

    assert len(property_values) == 16
    properties: dict = dict(zip(PROPERTY_KEYS, property_values))

    for name in properties:
        if name in OFFSET_COLUMNS:
            for atom in atoms:
                properties[name] -= OFFSETS[atom.symbol][  # type: ignore
                    OFFSET_COLUMNS.index(name)
                ]
        properties[name] *= RESCALINGS[PROPERTY_KEYS.index(name)]

    properties["frequencies"] = list(map(float, frequencies.split()))

    # molecule characterisation
    properties["smiles"] = smiles.split()[-1]
    properties["inchi"] = inchi.split()[-1]
    properties["geometry"] = property_values[0] not in BAD_IDS

    atoms.info = properties

    return atoms
DatabaseEntry for QM9
name: QM9
year: 2014
license: CC0
category: Benchmarks
description: |
    134k stable organic molecules made up of CHONF and containing up to 9 heavy atoms.
    Each molecule's geometry was relaxed at the PM7 semi-empirical level of theory, before
    being labelled with DFT.
    For more information, see `Quantum chemistry structures and properties of
    134 kilo molecules <https://doi.org/10.1038/sdata.2014.22>`_.
    Internally, files are downloaded from `FigShare <https://figshare.com/collections/Quantum_chemistry_structures_and_properties_of_134_kilo_molecules/978904>`_,.
    Energy labels are quoted in eV, relative to the isolated atoms of the molecule.
minimum_load_atoms_version: 0.2
representative_structure: 23810
citation: |
    @article{Ramakrishnan-17,
       author={
            Ramakrishnan, Raghunathan and Dral, Pavlo O and
            Rupp, Matthias and von Lilienfeld, O Anatole
       },
       title = {Data for 133885 GDB-9 molecules},
       year = {2017},
       month = {6},
       doi = {10.6084/m9.figshare.978904_D12}
    }
    @article{Ramakrishnan-14,
        title={
            Quantum chemistry structures and properties of 134 kilo molecules
        },
        author={
            Ramakrishnan, Raghunathan and Dral, Pavlo O and
            Rupp, Matthias and von Lilienfeld, O Anatole
        },
        journal={Scientific Data},
        volume={1},
        year={2014},
        publisher={Nature Publishing Group}
    }
    @article{Ruddigkeit-12,
        title = {
            Enumeration of 166 {{Billion Organic Small Molecules}}
            in the {{Chemical Universe Database GDB-17}}
        },
        author = {
            Ruddigkeit, Lars and {van Deursen}, Ruud and
            Blum, Lorenz C. and Reymond, Jean-Louis
        },
        year = {2012},
        journal = {Journal of Chemical Information and Modeling},
        volume = {52},
        number = {11},
        pages = {2864--2875},
        doi = {10.1021/ci300415d},
    }
per_atom_properties:
    partial_charges:
        desc: Mulliken partial atomic charges
        units: e
per_structure_properties:
    index:
        desc: consecutive index of molecule
    A:
        desc: Rotational constant A
        units: GHz
    B:
        desc: Rotational constant B
        units: GHz
    C:
        desc: Rotational constant C
        units: GHz
    mu:
        desc: Dipole moment
        units: Debye
    alpha:
        desc: Isotropic polarizability
        units: Bohr\ :math:`^3`
    homo:
        desc: HOMO energy
        units: eV
    lumo:
        desc: LUMO energy
        units: eV
    gap:
        desc: HOMO-LUMO energy gap
        units: eV
    r2:
        desc: electronic spatial extent
        units: Bohr\ :math:`^2`
    zpve:
        desc: zero point vibrational energy
        units: eV
    U0:
        desc: internal energy at 0 K
        units: eV
    U:
        desc: internal energy at 298.15 K
        units: eV
    H:
        desc: enthalpy at 298.15 K
        units: eV
    G:
        desc: free energy at 298.15 K
        units: eV
    Cv:
        desc: heat capacity at 298.15 K
        units: "cal mol\ :math:`^{-1}` K\ :math:`^{-1}`"
    frequencies:
        desc: harmonic frequencies
        units: cm\ :math:`^{-1}`
    geometry:
        desc: final geometry check passed
    smiles:
        desc: "`SMILES <https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system>`_ string"
    inchi:
        desc: "`InChI <https://en.wikipedia.org/wiki/International_Chemical_Identifier>`_ identifier"


# TODO: remove after Dec 2024
# backwards compatability: unused as of 0.3.0
files:
    - url: https://figshare.com/ndownloader/files/3195389
      name: dsgdb9nsd.xyz.tar.bz2
      hash: 3a63848ac806
processing:
    - UnZip
    - ForEachFile:
          pattern: "**/*.xyz"
          steps:
              - Custom:
                    id: read_qm9_xyz