graph-pes-train
¶
graph-pes-train
is a command line tool for training graph-based potential energy surface models using PyTorch Lightning:
$ graph-pes-train -h
usage: graph-pes-train [-h] [args [args ...]]
Train a GraphPES model using PyTorch Lightning.
positional arguments:
args Config files and command line specifications.
Config files should be YAML (.yaml/.yml) files.
Command line specifications should be in the form
nested/key=value. Final config is built up from
these items in a left to right manner, with later
items taking precedence over earlier ones in the
case of conflicts.
optional arguments:
-h, --help show this help message and exit
Copyright 2023-24, John Gardner
For a hands-on introduction, try our quickstart Colab notebook. Alternatively, you can learn about how to use graph-pes-train
from the basics guide, the complete configuration documentation or a set of examples.
There are a few important things to note when using graph-pes-train
in special situations:
Multi-GPU training:¶
The graph-pes-train
command supports multi-GPU out of the box, relying on PyTorch Lightning’s native support for distributed training.
By default, ``graph-pes-train`` will attempt to use all available GPUs. You can override this by exporting the CUDA_VISIBLE_DEVICES
environment variable:
$ export CUDA_VISIBLE_DEVICES=0,1
$ graph-pes-train config.yaml
Non-interactive jobs¶
In cases were you are running graph-pes-train
in a non-interactive session (e.g. from a script or scheduled job) and where you wish to make use of the Weights and Biases logging functionality, you will need to take one of the following steps:
run
wandb login
in an interactive session beforehand - this will cache your credentials to~/.netrc
set the
WANDB_API_KEY
environment variable to your W&B API key directly before runninggraph-pes-train
Failing to do this will result in graph-pes-train
hanging forever while waiting for you to log in to your W&B account.
Alternatively, you can set the wandb: null
flag in your config file to disable W&B logging.
Compute clusters¶
If you are running graph-pes-train
on a compute cluster as a scheduled job, ensure that you:
use a
"logged"
progress bar so that you can monitor the progress of your training run directly from the jobs outputscorrectly set the
CUDA_VISIBLE_DEVICES
environment variable so thatgraph-pes-train
makes use of all the GPUs you have requested (and no others) (see above)- consider copying across your data to the worker nodes, and running
graph-pes-train
from there rather than on the head node graph-pes-train
writes checkpoints semi-frequently to disk, and this may cause issues/throttle the clusters network.if you are using a disk-backed dataset (for instance reading from an
<ase>.db
file), each data point access will require an I/O operation, and reading from local file storage on the worker nodes will be many times faster than over the network.
- consider copying across your data to the worker nodes, and running