The Data Machine

ICER is excited to offer a new computational resource called the “Data Machine”. Many research areas are now faced with large amounts of data thanks in large part to the growth of available datasets. Manipulating, analyzing, and visualizing large amounts of data requires specialized computational resources that are not typically offered by more traditional high performance computing systems. Additionally, this “data explosion” is occurring primarily in fields where research computing has historically not been widely used.

To meet these hardware and educational needs, ICER is developing the Data Machine and associated outreach and training programs. Though the machine is not yet available to the broader research community (both at MSU and beyond), ICER is looking to connect with researchers who are interested in this new initiative and willing to experiment with the machine and offer feedback. ICER would also like to develop relationships with instructors hoping to integrate data-intensive computation (including demonstrations, assignments, and projects) into their classrooms.

Researchers and instructors looking for more information are encouraged to read below and submit short proposals. For the time being, access to the Data Machine is only provided upon request.

Would you benefit from the Data Machine?

The Data Machine is structured to benefit users whose research and/or instruction is described by many or all of the following characteristics:

Datasets that are many GB in size or larger
A combination of datasets of varying types and provenances
Datasets that require many read & write (I/O) operations, including those composed of many small files
Desire for interactive data analysis or workflow development
Desire to incorporate data from publicly available repositories or Cloud-based networks
Desire to share data products with researchers at other institutions
Use of machine learning and artificial intelligence techniques

In particular, the Data Machine is intended to be accessible to researchers (particularly students) with little background in programming or high performance computing.

Example Use Cases

Below are some example use cases for the new Data Machine drawn from early users. We hope these examples will illuminate how the Data Machine might benefit your research.

Genomics of Microbial Communities

A DNA sequencing machine can produce millions of short raw sequences that need to be combined into larger assemblies and genomes in order to be useful. This requires accessing many small files and holding lots of data in memory. These assemblies are then compared to data available through public repositories and inform ecological and evolutionary modeling.

Agent-based models (ABMs) simulate thousands of individual actors to uncover large scale behavioral patterns. Each of these agents require information from a large combination of datasets from archival maps to near real-time GPS and social media data. Aggregating these datasets for use in ABMs benefits from interactive development and testing, requiring the data to be available in memory. Production runs of ABMs produce many terabytes of data that similarly benefit from interactive workflow development as well as high-end GPUs to accelerate machine learning and visualization.

Spatial and Community Ecology

Data from a breadth of biological, ecological, and earth science disciplines can be combined to provide insights on large-scale ecological patterns and their drivers. This requires large-scale statistical analysis of combined datasets. These datasets and analysis tools can then be made available to external collaborators. Additionally, these data can be used to parameterize mathematical models that simulate ecological interactions over long periods of time.

Data-Driven Turbulence Modeling

Simulations capturing the behavior of turbulent flows at both small and large scales are computationally expensive, taking months to run and producing many terabytes of data. Machine learning algorithms can be trained to emulate small scale behavior of turbulence for more efficient modeling of large scale systems. These models must be trained on large quantities of high-fidelity simulation data, requiring high-end GPUs.

Design of the Data Machine

The characteristics outlined above can be translated into hardware constraints, particularly large amounts of memory per core, ample low latency data storage, GPUs, and networks capable of transferring large volumes of data. The Data Machine will have a total of 8 nodes, 4 focused on CPU-intensive jobs and 4 for jobs requiring GPUS.

The Data Machine will be connected to the existing HPCC file systems and compute resources. Though the CPU and GPU hardware is similar to what is offered by the current amd21 and amd22 clusters, what sets the Data Machine apart is the amount of memory available to CPU cores, the amount of local data storage available to the node, and the way users will interface with the machine.

CPU Nodes

Each of the 4 CPU-focused nodes will have the following:

128 CPU cores
2 TB of memory
32 TB of local high speed SSD storage

GPU Nodes

Each of the 4 GPU nodes will have the following:

128 CPU cores
512 GB of memory
32 TB of local high speed SSD storage
4 NVIDIA A100 GPUs with 80 GB of memory per GPU

The A100 GPUs have a combination of tensor processing units and mixed-precision arithmetic units, making them ideal for machine learning and artificial intelligence applications. A job may utilize multiple GPUs per node, or the GPUs may be partitioned for use in interactive data exploration or for course work.

The Advantage of Solid State Drives

A unique feature of the Data Machine is that each node (CPU or GPU) is directly connected to 32 TB of solid state drive (SSD) storage. This local storage allows for more efficient read and write (I/O) access than the existing disk-based file systems used by the HPCC.

These SSDs are accessed following the NVMe specification, which allows for many possible data access options. Some of these options include direct filesystem access from the GPUs (bypassing the CPU) or using a portion of the SSD as virtual memory to allow access to datasets larger than the memory limits of the node. Users interested in these alternative configurations are encouraged to contact ICER.

Running on the Data Machine

Though the Data Machine is not yet ready for general access, users can anticipate the following workflow features. Please note that the Data Machine will be separate from ICER’s buy-in program and is currently only accessible upon request.

Interactive Jobs

The structure of ICER’s HPCC is oriented towards submitting batch jobs to a system queue, which has been the traditional access pattern for high performance computing (HPC) resources. The Data Machine will instead prioritize interactive usage through OnDemand. Via OnDemand, users will have access to tools such as RStudio, Jupyter notebooks, Matlab, and Stata. Other tools can be added to OnDemand upon request; if this is desired, please fill out a ticket. When the Data Machine is experiencing high demand, interactive jobs will be able to preempt lower-priority workloads.

Containerization

Research groups who can make efficient use of the Data Machine for their research may have complex and specialized software needs. This includes software with large sets of dependencies or which expect a particular runtime environment. The difficulties associated with building and deploying such software can be greatly alleviated by containers. Containers bundle together the user’s software and its minimum set of dependencies into a single executable that will behave consistently on any system.

ICER will develop containers for the most common anticipated use cases (e.g. ML/AI with Python, bioinformatics) and make these container images available to users. ICER will also work with users looking to design images for their own research groups, courses, etc. Users interested in developing containers to match their needs should contact ICER.

The Data Machine will be connected to ICER’s existing infrastructure, including the HPCC file systems and MSU's High-Speed Research Network. The latter will allow researchers to share their datasets and other products with collaborators outside of MSU via Globus.

Researchers will also be able to move data into and out of the Data Machine via cloud systems. High speed access to the main cloud providers (AWS, Google Cloud, and Microsoft Azure) is a priority for the Data Machine so that researchers can leverage datasets regardless of where they are stored.

Educational Support

ICER will be offering dedicated training for the Data Machine in addition to its traditional offerings. These trainings will be available both synchronously and asynchronously via workshops, web-based tutorials, and self-paced training modules. All new users will be required to take an orientation workshop prior to being granted access to the Data Machine. This will ensure users have the knowledge and skills to make efficient use of the machine.

Users of the Data Machine will be able to access ICER support staff through the existing ticket system and weekly office hours on Microsoft Teams. Researchers requiring additional assistance with, for example, particularly complex research needs or workflow development are encouraged to leverage the ARCS program.

Support will also be offered to instructors looking to integrate computational exercises into their courses. Instructors interested in initiating such projects should submit an abstract to this form.

Data Machine User Advisory Board

Once the Data Machine becomes widely available, ICER will be looking for graduate students, postdocs, and faculty who use the Data Machine to join the User Advisory Board (UAB). This UAB will meet regularly to discuss the current state of the machine and user experiences, and to make recommendations about software and policy changes. The UAB will also make recommendations as to user training and support for the Data Machine.

A typical UAB term is two years, and researchers may serve consecutive terms. Researchers interested in serving on the UAB should contact ICER.

Request Access to the Data Machine

Currently, the Data Machine is only available upon request. Users interested in using the Data Machine will need to submit short proposals for research and instructional support using this form with the subject line "Data Machine". Note that usage of the Data Machine is subject to providing a short annual report detailing accomplishments using the machine.

Instructional use of the Data Machine should be requested by the course's lead instructor. Research use should be requested by the Principal Investigator (PI).

For both research and instructional use, the requester must supply the following information:

The NetID of the user(s) they are making the request for
A short abstract describing the research or course work that will be done on the data machine
A brief justification for why the resources of the Data Machine will benefit this work
Acknowledgement that the requester agrees to provide a short annual report detailing accomplishments using the machine (if requested by ICER)