Compute Options

Which HPC Compute option should I choose?

The most important driver for which HPC compute option is right for you comes down to this: What do you intend to use it for? Here are some typical HPC use cases to consider:

Batch/Pipeline

Batching computing and data pipelines are strategies that allow for efficient execution of a large number of jobs. Batching computing jobs run with minimal or no user interaction, and are often scheduled to run during low-use hours or as resources become available. Data pipelines break up jobs with sequential steps into distinct elements and execute them in parallel, creating an assembly line that then compiles these elements into completed jobs.

  • Systems Biology - The HPC Liaison can help you optimize your code for batch computing and using pipelines from the beginning, increasing efficiency and eliminating bugs before they impact your progress.
  • AWS - Using AWS’s popular Elastic Cloud Computing (EC2) service, you can create an almost unlimited number of virtual machines simultaneously, reducing the time needed to complete your tasks. Additional toolkits like StarCluster and AWS Batch can make provisioning resources easier and help manage costs.

Genomic Sequencing

Genomic Sequencing requires vast resources for effective analysis of genome sequences due to the large amounts of data involved. By using HPC as a solution, researchers can eliminate data bottlenecks and can store data more efficiently and affordably.

  • Systems Biology – The powerful resources available through the Systems Biology HPC cluster can greatly reduce the time involved with genomic research, and personalized user support can help make sure that jobs are run at peak efficiency.
  • AWS - EC2 supports a variety of platforms for uploading, sequencing, and storing large-scale data, with the flexibility and tools required for managing custom workflows and creating efficient, reusable pipelines.

Visualization

Visualization is an essential tool for presenting information gleaned from extremely large data sets in a clear and meaningful way. By creating graphics, maps, and other helpful representations of data and analysis results, visualizations can help convey scope, context, and quantitative messaging. HPC can accelerate this process with specialized hardware like graphics processing units (GPUs).

  • Systems Biology - The HPC cluster offers access to 148 NVIDIA GPUs with 75,776 total CUDA cores, providing significant resources geared towards data visualization and user support from Systems Biology and MSPH staff.
  • AWS - NICE DCV is a free-to-use remote display protocol that lets users run graphics-intensive applications on EC2 instances and stream the results without the need for dedicated workstations. AWS also offers services like QuickSight, which can connect to supported data sources like AWS Simple Storage Service (S3) to create visualizations.

Collaboration

MSPH researchers often collaborate with other institutions, providing shared access to data or even compute environments to facilitate faster results, replication, and feedback. Users must determine what level of access to provide, what resources are needed, and how quickly data will need to be shared.

  • Systems Biology - MSPH can provide approved collaborators with their own user accounts, offering direct access to the cluster and storage services at Systems Biology. Users can share local data with outside collaborators using high-speed services like Globus, or for cases where time is less critical, data can be sent via courier.
  • AWS - Services like EC2 allow collaborators to share data, machine images, or workflows. Because the shared data is already in the cloud, availability is the same for all. AWS users must coordinate what services and configurations are used, and determine how to account for the shared responsibilities and variable costs with collaborators. 

Program-Specific

For some programming languages or interfaces, the availability and scalability of resources can determine which HPC option is right for you. 

  • Systems Biology - Most commonly used programming languages are available and pre-installed on the Systems Biology cluster. The HPC Liaison can review your code and determine if your requested resources requirements are appropriate. 
  • AWS - Some programming languages require the flexibility to quickly change or recreate multiple instances with little notice or downtime. For example, when using RStudio, creating multiple instances on demand is easily accomplished using AWS EC2.

Machine Learning

Machine learning uses pattern recognition to “learn” from data to create analytical models and make decisions with minimal human intervention, providing potentially faster analysis and trend recognition.

  • Systems Biology – The HPC cluster supports leading products like TensorFlow, an end-to-end open-source specialized platform for machine learning.
  • AWS - Services like SageMaker let users create, train, and deploy ML models, and AWS provides access to Deep Learning AMIs for building custom environments and workflows.