Using the Data Machine
This page acts as a reference for using some of the features of the Data Machine. For more general information on what the Data Machine can offer, please see the Data Machine overview.
Table of Data Machine resources
Node | CPUs | Memory | Local NVME storage | GPU | GPU memory |
---|---|---|---|---|---|
acm-048 acm-049 acm-070 acm-071 |
128 | 2 TB | 32 TB | ||
nal-004 nal-005 |
128 | 512 GB | 32 TB | 4 NVIDIA A100 GPUs each | 80 GB (per GPU) |
nal-006 nal-007 |
128 | 512 GB | 32 TB | 4 NVIDIA A100 GPUs each split into 7 allocatable units |
10 GB (per unit) |
Two GPU nodes have four full A100 GPUs and two GPU nodes have four GPUs each split into seven units that can be requested. Each of these split units has 10 GB of memory. These units are requested similarly to normal GPUs. See the examples below.
Running code on the Data Machine
Though the Data Machine is not a buy-in node, the
same procedures are used behind the scenes to run on Data Machine nodes.
Therefore, users must be added to the data-machine
buy-in account to run jobs
on the Data Machine. To be added to this account please submit a request.
Note
The data-machine
account is limited to Data Machine nodes only.
OnDemand Data Machine access
Each OnDemand app has an "Advanced Options" checkbox. This opens additional form
entries. To use the Data Machine nodes, enter data-machine
in the SLURM Account
text box. Your job will queue onto a Data Machine node. Other resources (time,
CPU and memory) can be requested as usual by filling out the "Number of hours",
"Number of cores per task" and "Amount of memory" boxes.
GPU access
To use a single GPU unit with your OnDemand session, in the "Advanced Options" section and enter a100_slice
under "Number of GPUs". If you would like multiple units, use a100_slice:n
where n
is the number of units you would like on a single node.
To request full GPUs, use a100
instead of a100_slice
.
SLURM scripting Data Machine access
Below are some examples of SLURM resource requests for the Data Machine.
Partial Data Machine node
1 2 3 4 5 6 |
|
Full Data Machine node with no GPU with large memory
1 2 3 4 5 6 |
|
One GPU unit on a single node
1 2 3 4 5 6 7 |
|
Two GPU units on a single node
1 2 3 4 5 6 7 |
|
One full GPU on a single node
1 2 3 4 5 6 7 |
|
Two full GPUs on a single node
1 2 3 4 5 6 7 |
|
Using the fast NVME storage
You can preload your data into local NVME storage using "burst buffers". SLURM will move the data you want to use into NVME storage before your job starts.
Requesting a node
At the moment, burst buffers work best when requesting one specific node in the data machine. This ensures that the time SLURM takes to move your data does not count against the time you reserve the node for.
However, be careful which node you pick. If this node is busy, SLURM will wait until it is available to assign it to you. You can use the
1 |
|
to see the current usage of the Data Machine nodes.
Example burst buffer resource specification
In this example, we'll assume that we don't need a GPU and choose acm-048
.
1 2 3 4 5 6 7 8 |
|
Using the local data
SLURM sets an environment variable BB_DATA
with the location of your data on
the local NVME storage. Use this directory to access your data with less
latency than the home, research, or scratch space where it originally came
from.
Saving data written to local storage
Usually, if you edit data on local storage, your changes will be lost after the
job ends. However, if you add the specification resync=true
to the #BB
line
in your submission script, that data will be copied to the location it was
originally taken from after the job ends.
Example:
1 2 3 4 5 6 7 8 |
|
Debugging burst buffer issues
To check on the status of your submitted jobs, use the command
1 |
|
The NODELIST(REASON)
column of the output may give information relevant to
burst buffer steps, e.g.,
1 2 3 |
|
A job with the BurstBufferResources
reason is waiting for a node to run on
and begin transferring resources. In the example above, a job is running the
slurm_bb_data_in
, i.e., it is transferring the data to the node.
For more information or if there are problems with the burst buffer specification, use the command
1 |
|
For example, running scontrol show job 24411197
while the job above was transferring data, the output ends with
1 2 3 4 |
|
Often the Comment
field in the scontrol show job <jobid>
output can give helpful burst buffer information.