Using the Data Machine
This page acts as a reference for using some of the features of the Data Machine. For more general information on what the Data Machine can offer, please see the Data Machine overview.
Table of Data Machine resources
Node | CPUs | Memory | Local NVME storage | GPU | GPU memory |
---|---|---|---|---|---|
acm-048 acm-049 acm-070 acm-071 |
128 | 2 TB | 32 TB | ||
nal-004 nal-005 |
128 | 512 GB | 32 TB | 4 NVIDIA A100 GPUs each | 80 GB (per GPU) |
nal-006 nal-007 |
128 | 512 GB | 32 TB | 4 NVIDIA A100 GPUs each split into 7 allocatable units |
10 GB (per unit) |
Two GPU nodes have four full A100 GPUs and two GPU nodes have four GPUs each split into seven units that can be requested. Each of these split units has 10 GB of memory. These units are requested similarly to normal GPUs. See the examples below.
Acknowledging Data Machine usage
If you use the Data Machine in your work, please use our acknowledgement.
Running code on the Data Machine
Though the Data Machine is not a buy-in node, the
same procedures are used behind the scenes to run on Data Machine nodes.
Therefore, users must be added to the data-machine
buy-in account to run jobs
on the Data Machine. To be added to this account please submit a request.
Note
The data-machine
account is limited to Data Machine nodes only.
OnDemand Data Machine access
Each OnDemand app has an "Advanced Options" checkbox. This opens additional form
entries. To use the Data Machine nodes, enter data-machine
in the SLURM Account
text box. Your job will queue onto a Data Machine node. Other resources (time,
CPU and memory) can be requested as usual by filling out the "Number of hours",
"Number of cores per task" and "Amount of memory" boxes.
GPU access
To use a single GPU unit with your OnDemand session, in the "Advanced Options" section and enter a100_slice
under "Number of GPUs". If you would like multiple units, use a100_slice:n
where n
is the number of units you would like on a single node.
To request full GPUs, use a100
instead of a100_slice
.
SLURM scripting Data Machine access
Below are some examples of SLURM resource requests for the Data Machine.
Partial Data Machine node
#!/bin/bash
#SBATCH --account=data-machine # Run under the data machine buy-in
#SBATCH --nodes=1 # Reserve only one node
#SBATCH --time=4:00:00 # Reserve for four hours (or your desired amount of time)
#SBATCH --mem=256GB # Set to your desired amount of memory
#SBATCH --cpus-per-task=32 # Set to your desired number of CPUs
Full Data Machine node with no GPU with large memory
#!/bin/bash
#SBATCH --account=data-machine # Run under the data machine buy-in
#SBATCH --nodes=1 # Reserve only one node
#SBATCH --time=4:00:00 # Reserve for four hours (or your desired amount of time)
#SBATCH --mem=2TB # Uses all memory on a large memory node
#SBATCH --cpus-per-task=128 # Uses all CPUs on a node
One GPU unit on a single node
#!/bin/bash
#SBATCH --account=data-machine # Run under the data machine buy-in
#SBATCH --nodes=1 # Reserve only one node
#SBATCH --time=4:00:00 # Reserve for four hours (or your desired amount of time)
#SBATCH --mem=256GB # Set to your desired amount of memory
#SBATCH --cpus-per-task=32 # Set to your desired number of CPUs
#SBATCH --gpus=a100_slice # Request one GPU unit on the reserved node
Two GPU units on a single node
#!/bin/bash
#SBATCH --account=data-machine # Run under the data machine buy-in
#SBATCH --nodes=1 # Reserve only one node
#SBATCH --time=4:00:00 # Reserve for four hours (or your desired amount of time)
#SBATCH --mem=256GB # Set to your desired amount of memory
#SBATCH --cpus-per-task=32 # Set to your desired number of CPUs
#SBATCH --gpus=a100_slice:2 # Request two GPU units on the reserved node
One full GPU on a single node
#!/bin/bash
#SBATCH --account=data-machine # Run under the data machine buy-in
#SBATCH --nodes=1 # Reserve only one node
#SBATCH --time=4:00:00 # Reserve for four hours (or your desired amount of time)
#SBATCH --mem=256GB # Set to your desired amount of memory
#SBATCH --cpus-per-task=32 # Set to your desired number of CPUs
#SBATCH --gpus=a100 # Request one GPU on the reserved node
Two full GPUs on a single node
#!/bin/bash
#SBATCH --account=data-machine # Run under the data machine buy-in
#SBATCH --nodes=1 # Reserve only one node
#SBATCH --time=4:00:00 # Reserve for four hours (or your desired amount of time)
#SBATCH --mem=256GB # Set to your desired amount of memory
#SBATCH --cpus-per-task=32 # Set to your desired number of CPUs
#SBATCH --gpus=a100:2 # Request two GPUs on the reserved node
Using the fast NVME storage
You can preload your data into local NVME storage using "burst buffers". SLURM will move the data you want to use into NVME storage before your job starts.
Requesting a node
At the moment, burst buffers work best when requesting one specific node in the data machine. This ensures that the time SLURM takes to move your data does not count against the time you reserve the node for.
However, be careful which node you pick. If this node is busy, SLURM will wait until it is available to assign it to you. You can use the
buyin_status --account data-machine
to see the current usage of the Data Machine nodes.
Example burst buffer resource specification
In this example, we'll assume that we don't need a GPU and choose acm-048
.
#!/bin/bash
#SBATCH --account=data-machine # Run under the data machine buy-in
#SBATCH --nodelist=acm-048 # Restrict to a specific data machine node
#SBATCH --nodes=1 # Reserve only one node
#SBATCH --time=4:00:00 # Reserve for four hours (or your desired amount of time)
#SBATCH --memory=256GB # Set to your desired amount of memory
#SBATCH --cpus-per-task=128 # Set to your desired number of CPUs
#BB source=/mnt/home/<username>/important/data/here
Using the local data
SLURM sets an environment variable BB_DATA
with the location of your data on
the local NVME storage. Use this directory to access your data with less
latency than the home, research, or scratch space where it originally came
from.
Saving data written to local storage
Usually, if you edit data on local storage, your changes will be lost after the
job ends. However, if you add the specification resync=true
to the #BB
line
in your submission script, that data will be copied to the location it was
originally taken from after the job ends.
Example:
#!/bin/bash
#SBATCH --account=data-machine # Run under the data machine buy-in
#SBATCH --nodelist=acm-048 # Restrict to a specific data machine node
#SBATCH --nodes=1 # Reserve only one node
#SBATCH --time=4:00:00 # Reserve for four hours (or your desired amount of time)
#SBATCH --memory=256GB # Set to your desired amount of memory
#SBATCH --cpus-per-task=128 # Set to your desired number of CPUs
#BB source=/mnt/home/<username>/important/data/here resync=true
Debugging burst buffer issues
To check on the status of your submitted jobs, use the command
squeue --me
The NODELIST(REASON)
column of the output may give information relevant to
burst buffer steps, e.g.,
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
24411390 data-mach data_mac grosscra PENDING 0:00 1:00:00 1 (BurstBufferResources)
24411197 data-mach data_mac grosscra PENDING 0:00 1:00:00 1 (burst_buffer/lua: slurm_bb_data_in: )
A job with the BurstBufferResources
reason is waiting for a node to run on
and begin transferring resources. In the example above, a job is running the
slurm_bb_data_in
, i.e., it is transferring the data to the node.
For more information or if there are problems with the burst buffer specification, use the command
scontrol show job <jobid>
For example, running scontrol show job 24411197
while the job above was transferring data, the output ends with
...
BurstBuffer=#BB source=/mnt/home/grosscra/scripts
BurstBufferState=staging-in
...
Often the Comment
field in the scontrol show job <jobid>
output can give helpful burst buffer information.