Additional Best Practices
Use Miniconda
It's usually unnecessary to install full anaconda environment, miniconda should be enough (weights around 80MB). Big advantage of conda is that it allows for installing packages without requiring certain compilers or libraries to be available in the system (since it installs precompiled binaries), so it often makes it easier to install some dependencies e.g. cudatoolkit for GPU support. It also allows you to access your environments globally which might be more convenient than creating new local environment for every project. Example installation:wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda update -n base -c defaults conda
conda create -n myenv python=3.10
conda activate myenv
Set private environment variables in .env file
System specific variables (e.g. absolute paths to datasets) should not be under version control or it will result in conflict between different users. Your private keys also shouldn't be versioned since you don't want them to be leaked.Template contains `.env.example` file, which serves as an example. Create a new file called `.env` (this name is excluded from version control in .gitignore). You should use it for storing environment variables like this:
MY_VAR=/home/user/my_system_path
path_to_data: ${oc.env:MY_VAR}
Name metrics using '/' character
Depending on which logger you're using, it's often useful to define metric name with `/` character:self.log("train/loss", loss)
Use torchmetrics
Use official [torchmetrics](https://github.com/PytorchLightning/metrics) library to ensure proper calculation of metrics. This is especially important for multi-GPU training! For example, instead of calculating accuracy by yourself, you should use the provided `Accuracy` class like this:from torchmetrics.classification.accuracy import Accuracy
class LitModel(LightningModule):
def __init__(self)
self.train_acc = Accuracy()
self.val_acc = Accuracy()
def training_step(self, batch, batch_idx):
...
acc = self.train_acc(predictions, targets)
self.log("train/acc", acc)
...
def validation_step(self, batch, batch_idx):
...
acc = self.val_acc(predictions, targets)
self.log("val/acc", acc)
...
Follow PyTorch Lightning style guide
The style guide is available [here](https://pytorch-lightning.readthedocs.io/en/latest/starter/style_guide.html).1. Be explicit in your init. Try to define all the relevant defaults so that the user doesn’t have to guess. Provide type hints. This way your module is reusable across projects!
class LitModel(LightningModule):
def __init__(self, layer_size: int = 256, lr: float = 0.001):
class LitModel(LightningModule):
def __init__():
...
def forward():
...
def training_step():
...
def training_step_end():
...
def on_train_epoch_end():
...
def validation_step():
...
def validation_step_end():
...
def on_validation_epoch_end():
...
def test_step():
...
def test_step_end():
...
def on_test_epoch_end():
...
def configure_optimizers():
...
def any_extra_hook():
...
Use Tmux
Tmux is a terminal multiplexer, which allows you to run multiple terminal sessions in a single window. It's especially useful when you want to run your training script on a remote server and you want to keep it running even after you close the ssh connection. More about tmux can be found here.Specify the GPU device
When running your script on a server with multiple GPUs, you should specify which GPU to use. You can do this by setting the `CUDA_VISIBLE_DEVICES` environment variable:CUDA_VISIBLE_DEVICES=0 python train.py
CUDA_VISIBLE_DEVICES=0,1 python train.py