Warning

This is as a Lab Notebook which describes how to solve a specific problem at a specific time. Please keep this in mind as you read and use the content. Please pay close attention to the date, version information and other details.

Common Machine Learning Tools (TensorFlow, Keras, scikit-learn, PyTorch) on OnDemand

This lab notebook discusses the installation and usage of common machine learning tools (TensorFlow, Keras, scikit-learn, and PyTorch) in Jupyter Notebooks through the HPCC's OnDemand interface. It is mostly oriented towards students in the fall 2024 section of CMSE 492-001 or CMSE 802-001 using the Data Machine, but will hopefully be useful to a general audience.

Usage instructions

We recommend using a Conda environment including the above tools through a Jupyter notebook. To do so, visit ICER's OnDemand portal and click the Jupyter app.

Enter resource request

In the settings, enter the time, cores, and memory you would like.

When using a GPU slice on the Data Machine as recommended below, it is best to use 4 as the "Number of cores per task" and 18GB as the "Amount of memory". This ensures that a single node can be split equally 28 ways. However, if you need more resources, please feel free to ask for them with the understanding that your job may take longer to queue.

Setup Conda environment

Under "Jupyter Location", choose "Conda Environment using Miniforge3 module". You now have two options:

Use a preinstalled Conda environment

Note

This section applies only to students in the fall 2024 section of CMSE 492-001 or CMSE 802-001.

If you are in the fall 2024 section of CMSE 492-001, in the "Conda Environment name or path" field, use /mnt/research/CMSE_492_FS24_S001/envs/ml. If you are in the fall 2024 section of CMSE 802-001, in the "Conda Environment name or path" field, use /mnt/research/CMSE_802_FS24_S001/envs/ml. Otherwise, follow the Use your own Conda environment instructions.

Use your own Conda environment

Follow the setup instructions below. In the "Conda Environment name or path" field, use ml (or whatever you named your Conda environment).

Run using a GPU on the Data Machine

Note

This section only applies to users who have Data Machine access.

Click the "Advanced Options" checkbox. In the "Number of GPUs" field, enter a100_slice. This reserves a slice of the Data Machine A100 GPUs with 10GB of GPU memory. In the "SLURM Account" field, enter data-machine.

Launch Jupyter

Press the "Launch" button at the bottom and wait for the job to queue and then for Jupyter to start up. This can take a couple of minutes. When Jupyter is ready, click the "Connect to Jupyter" button.

Appendix: Setup instructions

Use these instructions to setup your own Conda environment. You can make any customizations you like to better fit your workflow.

The first step is to login and make sure that you are on a development node with a GPU. This specific tutorial uses dev-amd20-v100 because it has a newer GPU more in line with the GPUs found on the Data Machine. Most importantly, they are both compatible with CUDA 12 whereas the k20 and k80 GPUs are not.

# From a gateway node
ssh dev-amd20-v100

Install packages

The next step is to get access to Conda. Using the Miniforge3 module, create a new environment and install the required packages from NVIDIA. Note that even though all of the packages are installed using pip, we still create a Conda environment as this works best with OnDemand.

module purge
module load Miniforge3 CUDA
conda create -n ml
conda activate ml
conda install python=3.11 pip
python -m pip install tensorflow[and-cuda]
python -m pip install cupy-cuda12x
python -m pip install scikit-learn
python -m pip install jupyter
python -m pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
python -m pip install torcheval
python -m pip install matplotlib
python -m pip install tqdm