Warning

This is as a Lab Notebook which describes how to solve a specific problem at a specific time. Please keep this in mind as you read and use the content. Please pay close attention to the date, version information and other details.

cuQuantum Installation and Usage

This lab notebook discusses the installation and usage of the cuQuantum software development kit (SDK), and in particular, the cuTensorNet library. It is mostly oriented towards students in the fall 2024 section of CMSE 890-001 using the Data Machine, but will hopefully be useful to a general audience.

Usage instructions

We recommend using a Conda environment with cuTensorNet installed through a Jupyter notebook. To do so, visit ICER's OnDemand portal and click the Jupyter app.

Enter resource request

In the settings, enter the time, cores, and memory you would like.

When using a GPU slice on the Data Machine as recommended below, it is best to use 4 as the "Number of cores per task" and 18GB as the "Amount of memory". This ensures that a single node can be split equally 28 ways. However, if you need more resources, please feel free to ask for them with the understanding that your job may take longer to queue.

Setup Conda environment

Under "Jupyter Location", choose "Conda Environment using Miniforge3 module". You now have two options:

Use a preinstalled Conda environment

Note

This section applies only to students in the fall 2024 section of CMSE 890-001.

If you are in the fall 2024 section of CMSE 890-001, in the "Conda Environment name or path" field, use /mnt/research/CMSE890_FS24_S001/envs/cuquantum. Otherwise, follow the Use your own Conda environment instructions.

Use your own Conda environment

Follow the setup instructions below. In the "Conda Environment name or path" field, use cuquantum (or whatever you named your Conda environment).

Run using a GPU on the Data Machine

Note

This section only applies to users who have Data Machine access.

Click the "Advanced Options" checkbox. In the "Number of GPUs" field, enter a100_slice. This reserves a slice of the Data Machine A100 GPUs with 10GB of GPU memory. In the "SLURM Account" field, enter data-machine.

Launch Jupyter

Press the "Launch" button at the bottom and wait for the job to queue and then for Jupyter to start up. This can take a couple of minutes. When Jupyter is ready, click the "Connect to Jupyter" button.

Running a sample program

NVIDIA provides an example of using the Python API for cuTensorNet on their technical blog. You can create a new Jupyter notebook, copy and paste this code into a cell, and run the cell. You should get a similar FLOP count of the optimized contraction path.

For more examples, see NVIDIA's cuQuantum Python documentation.

Compiling a sample CUDA code

The Conda environment also includes libraries to compile C++ based CUDA code for similar types of calculation with much more flexibility (and complexity). In order to use these libraries, you need to use the nvcc compiler which comes with the CUDA module on the HPCC. For more details about compiling CUDA programs, please see our documentation on Compiling for GPUs.

These commands will be run from the command line. You can open one in your Jupter Lab instance by opening a new tab and clicking terminal. Or you can submit an interactive job from an SSH session if you prefer.

In this example, we'll use the cuTensorNet contraction example provided by NVIDIA that performs similar calculations to the above referenced Python code. This can be downloaded to the HPCC with a command like

wget https://github.com/NVIDIA/cuQuantum/raw/main/samples/cutensornet/tensornet_example.cu

To compile this code, you first need to make sure CUDA is loaded and your conda environment is activated.

module purge
module load CUDA/12.3.0
module load Miniforge3
# Substitute the name of your environment if you created it yourself
conda activate /mnt/research/CMSE890_FS24_S001/envs/cuquantum

To compile the code, you need to point the nvcc compiler to the headers and libraries using the -I and -L flags. These are stored in ${CUQUANTUM_ROOT}/include and ${CUQUANTUM_ROOT}/lib respectively. This example uses the cutensornet and cutensor libraries, brought in using the -l flag. Thus, you can compile the example with the command

nvcc tensornet_example.cu -I${CUQUANTUM_ROOT}/include -L${CUQUANTUM_ROOT}/lib -lcutensornet -lcutensor -o tensornet_example

You can then run the executable with

./tensornet_example

Make sure CUDA is loaded and libraries are on your LD_LIBRARY_PATH

To run your code, it is important that the executable can find the libraries it needs at runtime. Make sure to always run your code after loading the CUDA module and activating your conda environment with

module purge
module load CUDA/12.3.0
module load Miniforge3
# Substitute the name of your environment if you created it yourself
conda activate /mnt/research/CMSE890_FS24_S001/envs/cuquantum

In particular, the libraries you used to compile need to be added to the LD_LIBRARY_PATH environment variable which is done by setting the extra environment variables in your Conda environment setup.

Optionally, if you skipped this step, you can also set your LD_LIBRARY_PATH manually with

export LD_LIBRARY_PATH="${CUQUANTUM_ROOT}/lib:${LD_LIBRARY_PATH}"

You will need to do this every time you start a new shell before running your code.

Appendix: Further reading

After experimenting with the samples above, you may be interested in the following:

Requesting multiple GPUs and chaining them together with MPI. Note that you can request multiple Data Machine GPU slices using a100_slice:n where n is the number of slices you would like to use.
Using whole GPUs in the Data Machine instead of slices (subject to availability).
Exploring the different types of Python bindings provided by the cuQuantum Python package.
Compile and run more cuTensorNet examples
Run more cuTensorNet Python examples

Appendix: Setup instructions

Use these instructions to setup your own Conda environment. You can make any customizations you like to better fit your workflow.

The first step is to login and make sure that you are on a development node with a GPU. This specific tutorial uses dev-amd20-v100 because it has a newer GPU more in line with the GPUs found on the Data Machine. Most importantly, they are both compatible with CUDA 12 whereas the k20 and k80 GPUs are not.

# From a gateway node
ssh dev-amd20-v100

Install packages

The next step is to get access to Conda. Using the Miniforge3 module, create a new environment and install the required packages from NVIDIA.

module purge
module load Miniforge3
conda create -n cuquantum
conda activate cuquantum
conda install cuquantum cuquantum-python openmpi cuda-version=12 cutensor
conda install jupyter  # For some reason jupyter needs to be installed separately

Set some extra environment variables

To make the most use of cuQuantum for CUDA compilation and native MPI support in cuTensorNet, you can set a few variables when you activate your Conda environment.

conda env config vars set CUQUANTUM_ROOT=${CONDA_PREFIX}

cat << 'EOF' > "${CONDA_PREFIX}/etc/conda/activate.d/env.sh"
export OMPI_MCA_opal_cuda_support="true"
export OLD_LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
export LD_LIBRARY_PATH="${CUQUANTUM_ROOT}/lib:${LD_LIBRARY_PATH}"
EOF

cat << 'EOF' > "${CONDA_PREFIX}/etc/conda/deactivate.d/env.sh"
unset OMPI_MCA_opal_cuda_support
export LD_LIBRARY_PATH="${OLD_LD_LIBRARY_PATH}"
unset OLD_LD_LIBRARY_PATH
EOF

chmod +x "${CONDA_PREFIX}/etc/conda/activate.d/env.sh" "${CONDA_PREFIX}/etc/conda/deactivate.d/env.sh"