QM9¶
134k stable organic molecules made up of CHONF and containing up to 9 heavy atoms. Each molecule’s geometry was relaxed at the PM7 semi-empirical level of theory, before being labelled with DFT. For more information, see Quantum chemistry structures and properties of 134 kilo molecules. Internally, files are downloaded from FigShare,. Energy labels are quoted in eV, relative to the isolated atoms of the molecule.
>>> from load_atoms import load_dataset
>>> load_dataset("QM9")
QM9:
structures: 133,885
atoms: 2,407,753
species:
H: 51.09%
C: 35.16%
O: 7.81%
N: 5.80%
F: 0.14%
properties:
per atom: (partial_charges)
per structure: (A, B, C, Cv, G, H, U, U0, alpha, frequencies, gap, geometry,
homo, inchi, index, lumo, mu, r2, smiles, zpve)
License¶
This dataset is licensed under the CC0 license.
Citation¶
If you use this dataset in your work, please cite the following:
@article{Ramakrishnan-17,
author={
Ramakrishnan, Raghunathan and Dral, Pavlo O and
Rupp, Matthias and von Lilienfeld, O Anatole
},
title = {Data for 133885 GDB-9 molecules},
year = {2017},
month = {6},
doi = {10.6084/m9.figshare.978904_D12}
}
@article{Ramakrishnan-14,
title={
Quantum chemistry structures and properties of 134 kilo molecules
},
author={
Ramakrishnan, Raghunathan and Dral, Pavlo O and
Rupp, Matthias and von Lilienfeld, O Anatole
},
journal={Scientific Data},
volume={1},
year={2014},
publisher={Nature Publishing Group}
}
@article{Ruddigkeit-12,
title = {
Enumeration of 166 {{Billion Organic Small Molecules}}
in the {{Chemical Universe Database GDB-17}}
},
author = {
Ruddigkeit, Lars and {van Deursen}, Ruud and
Blum, Lorenz C. and Reymond, Jean-Louis
},
year = {2012},
journal = {Journal of Chemical Information and Modeling},
volume = {52},
number = {11},
pages = {2864--2875},
doi = {10.1021/ci300415d},
}
Properties¶
Per-atom:
Property |
Units |
Type |
Description |
---|---|---|---|
|
e |
Mulliken partial atomic charges |
Per-structure:
Property |
Units |
Type |
Description |
---|---|---|---|
|
consecutive index of molecule |
||
|
GHz |
Rotational constant A |
|
|
GHz |
Rotational constant B |
|
|
GHz |
Rotational constant C |
|
|
Debye |
Dipole moment |
|
|
Bohr\(^3\) |
Isotropic polarizability |
|
|
eV |
HOMO energy |
|
|
eV |
LUMO energy |
|
|
eV |
HOMO-LUMO energy gap |
|
|
Bohr\(^2\) |
electronic spatial extent |
|
|
eV |
zero point vibrational energy |
|
|
eV |
internal energy at 0 K |
|
|
eV |
internal energy at 298.15 K |
|
|
eV |
enthalpy at 298.15 K |
|
|
eV |
free energy at 298.15 K |
|
|
cal mol \(^{-1}\) K \(^{-1}\) |
heat capacity at 298.15 K |
|
|
cm\(^{-1}\) |
harmonic frequencies |
|
|
final geometry check passed |
||
|
SMILES string |
||
|
InChI identifier |
Miscellaneous information¶
QM9
is imported as an
InMemoryAtomsDataset
:
Importer script for QM9
from __future__ import annotations
from io import StringIO
from pathlib import Path
from typing import Iterator
from ase import Atoms
from ase.io import read
from ase.units import Hartree, eV
from load_atoms.database.backend import BaseImporter, unzip_file
from load_atoms.database.internet import FileDownload
from load_atoms.progress import Progress
class Importer(BaseImporter):
@classmethod
def files_to_download(cls) -> list[FileDownload]:
return [
FileDownload(
url="https://figshare.com/ndownloader/files/3195389",
expected_hash="3a63848ac806",
local_name="qm9.tar.bz2",
)
]
@classmethod
def get_structures(
cls, tmp_dir: Path, progress: Progress
) -> Iterator[Atoms]:
# Unzip the file
contents_path = unzip_file(tmp_dir / "qm9.tar.bz2", progress)
# Process each XYZ file
xyz_files = sorted(contents_path.glob("*.xyz"))
total_files = len(xyz_files)
with progress.new_task(
"Processing QM9 structures", total=total_files
) as task:
for xyz_file in xyz_files:
yield read_qm9(xyz_file)
task.update(advance=1)
PROPERTY_KEYS = "index A B C mu alpha homo lumo gap r2 zpve U0 U H G Cv".split()
assert len(PROPERTY_KEYS) == 16
RESCALINGS = [1.0] * 16
for property in "homo lumo gap zpve U0 U H G".split():
# convert from Hartree to eV
RESCALINGS[PROPERTY_KEYS.index(property)] = Hartree / eV
# taken from https://figshare.com/ndownloader/files/3195395
OFFSET_COLUMNS = "U0 U H G".split()
OFFSETS = {
"H": [-0.5002730, -0.4988570, -0.4979120, -0.5109270],
"C": [-37.846772, -37.845355, -37.844411, -37.861317],
"N": [-54.583861, -54.582445, -54.581501, -54.598897],
"O": [-75.064579, -75.063163, -75.062219, -75.079532],
"F": [-99.718730, -99.717314, -99.716370, -99.733544],
}
# taken from https://figshare.com/ndownloader/files/3195404
BAD_IDS = [
int(id)
for id in (Path(__file__).parent.resolve() / "bad_qm9.txt")
.read_text()
.splitlines()
]
def read_qm9(file: Path) -> Atoms:
"""
Read in a single XYZ file from the QM9 dataset, and return it
as a list of a single Atoms object.
See the original README for a specification of the format used:
https://figshare.com/files/3195392
"""
(
n,
property_values,
*content,
frequencies,
smiles,
inchi,
) = file.read_text().replace("*^", "e").splitlines()
# fake an extxyz file and get ase to read it
header = 'Properties=species:S:1:pos:R:3:partial_charges:R:1 pbc="F F F"'
with StringIO("\n".join([n, header, *content])) as extxyz:
atoms: Atoms = read(extxyz, 0, format="extxyz") # type: ignore
# ignore first "gdb" property
property_values = [float(v) for v in property_values.split()[1:]]
property_values[0] = int(property_values[0])
assert len(property_values) == 16
properties: dict = dict(zip(PROPERTY_KEYS, property_values))
for name in properties:
if name in OFFSET_COLUMNS:
for atom in atoms:
properties[name] -= OFFSETS[atom.symbol][ # type: ignore
OFFSET_COLUMNS.index(name)
]
properties[name] *= RESCALINGS[PROPERTY_KEYS.index(name)]
properties["frequencies"] = list(map(float, frequencies.split()))
# molecule characterisation
properties["smiles"] = smiles.split()[-1]
properties["inchi"] = inchi.split()[-1]
properties["geometry"] = property_values[0] not in BAD_IDS
atoms.info = properties
return atoms
DatabaseEntry
for QM9
name: QM9
year: 2014
license: CC0
category: Benchmarks
description: |
134k stable organic molecules made up of CHONF and containing up to 9 heavy atoms.
Each molecule's geometry was relaxed at the PM7 semi-empirical level of theory, before
being labelled with DFT.
For more information, see `Quantum chemistry structures and properties of
134 kilo molecules <https://doi.org/10.1038/sdata.2014.22>`_.
Internally, files are downloaded from `FigShare <https://figshare.com/collections/Quantum_chemistry_structures_and_properties_of_134_kilo_molecules/978904>`_,.
Energy labels are quoted in eV, relative to the isolated atoms of the molecule.
minimum_load_atoms_version: 0.2
representative_structure: 23810
citation: |
@article{Ramakrishnan-17,
author={
Ramakrishnan, Raghunathan and Dral, Pavlo O and
Rupp, Matthias and von Lilienfeld, O Anatole
},
title = {Data for 133885 GDB-9 molecules},
year = {2017},
month = {6},
doi = {10.6084/m9.figshare.978904_D12}
}
@article{Ramakrishnan-14,
title={
Quantum chemistry structures and properties of 134 kilo molecules
},
author={
Ramakrishnan, Raghunathan and Dral, Pavlo O and
Rupp, Matthias and von Lilienfeld, O Anatole
},
journal={Scientific Data},
volume={1},
year={2014},
publisher={Nature Publishing Group}
}
@article{Ruddigkeit-12,
title = {
Enumeration of 166 {{Billion Organic Small Molecules}}
in the {{Chemical Universe Database GDB-17}}
},
author = {
Ruddigkeit, Lars and {van Deursen}, Ruud and
Blum, Lorenz C. and Reymond, Jean-Louis
},
year = {2012},
journal = {Journal of Chemical Information and Modeling},
volume = {52},
number = {11},
pages = {2864--2875},
doi = {10.1021/ci300415d},
}
per_atom_properties:
partial_charges:
desc: Mulliken partial atomic charges
units: e
per_structure_properties:
index:
desc: consecutive index of molecule
A:
desc: Rotational constant A
units: GHz
B:
desc: Rotational constant B
units: GHz
C:
desc: Rotational constant C
units: GHz
mu:
desc: Dipole moment
units: Debye
alpha:
desc: Isotropic polarizability
units: Bohr\ :math:`^3`
homo:
desc: HOMO energy
units: eV
lumo:
desc: LUMO energy
units: eV
gap:
desc: HOMO-LUMO energy gap
units: eV
r2:
desc: electronic spatial extent
units: Bohr\ :math:`^2`
zpve:
desc: zero point vibrational energy
units: eV
U0:
desc: internal energy at 0 K
units: eV
U:
desc: internal energy at 298.15 K
units: eV
H:
desc: enthalpy at 298.15 K
units: eV
G:
desc: free energy at 298.15 K
units: eV
Cv:
desc: heat capacity at 298.15 K
units: "cal mol\ :math:`^{-1}` K\ :math:`^{-1}`"
frequencies:
desc: harmonic frequencies
units: cm\ :math:`^{-1}`
geometry:
desc: final geometry check passed
smiles:
desc: "`SMILES <https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system>`_ string"
inchi:
desc: "`InChI <https://en.wikipedia.org/wiki/International_Chemical_Identifier>`_ identifier"
# TODO: remove after Dec 2024
# backwards compatability: unused as of 0.3.0
files:
- url: https://figshare.com/ndownloader/files/3195389
name: dsgdb9nsd.xyz.tar.bz2
hash: 3a63848ac806
processing:
- UnZip
- ForEachFile:
pattern: "**/*.xyz"
steps:
- Custom:
id: read_qm9_xyz