The power of cloud HPC for dynamically handling demands in customer-facing analysis
Executive Summary
Macrogen processes massive volumes of genomic data every day. For B2B and research customers, we can use on-premises infrastructure to meet turnaround times (TAT). However, with DTC, it is difficult to estimate analysis demand, especially since marketing activities can dramatically influence the number of samples at any given time. To distribute resources most efficiently and guide the company’s first cloud-based initiative for digital transformation, I architected a cloud-based high-performance computing (HPC) solution to meet strict turnaround times (TAT) for customers.
Key Contributions
Area | Contribution | Description |
Investigate Cloud Scalable Pipeline | 80% | Led the evaluation and selection of cloud-native pipeline solutions. |
Infrastructure Architecture & Design | 100% | Designed and implemented the AWS ParallelCluster environment with Slurm scheduler. |
Workflow Optimization | 100% | Identified and configured the most cost-effective and high-performance compute instances; optimized job scheduling and resource utilization. |
Bioinformatics Pipeline Optimization | 100% | Optimized the core bioinformatics pipeline (algorithms, scientific workflow, and tool parameters) for improved accuracy and efficiency. |
Monitoring & Troubleshooting | 100% | Performed manual monitoring and troubleshooting of the HPC environment; proactively resolved issues as they arose. Grafana integration is planned for future implementation. |
Achievements
- Stabilized turnaround time during peak genomic workloads
- Eliminated capacity bottlenecks caused by fixed on-prem infrastructure
- Built a future-ready HPC platform aligned with data growth
Introduction, Problem, and Goal
Introduction
Large-scale genomic analysis tasks such as ancestry inference and imputation-based analysis require high computational power and must be completed within strict TAT (Turnaround Time) constraints. However, the existing on-premise server infrastructure frequently encountered processing delays due to limited compute resources and static scheduling, causing workflow bottlenecks and extended turnaround times.
To address these challenges, I designed and implemented a Slurm-based Cloud HPC environment using AWS ParallelCluster, enabling dynamic auto-scaling, flexible resource expansion, and stable performance for large workloads. I architected the entire system, automated instance provisioning, and optimized batch processing efficiency through workload-aware configuration.
The new cloud HPC system significantly improved compute scalability and reduced operational costs by scaling based on demand. It enabled guaranteed TAT delivery, supported large nationwide analysis projects, and was internally recognized as a best practice in infrastructure innovation.
Problem
- On-premise computing hardware lacked flexibility to scale for large analysis batches.
- Scheduling inefficiencies caused processing delays and failure to meet TAT requirements.
- Infrastructure expansion required high cost and long lead times, limiting operational agility.
Goal
- Build a flexible HPC computing environment capable of autoscaling based on workload.
- Replace manual job scheduling with an automated Slurm-based cloud system.
- Reduce cost and guarantee TAT through elastic resource management.
Technical Overview
- HPC Framework
- AWS ParallelCluster
- AWS ASG (Auto Scaling Group, for scalability)
- Slurm job scheduler - Choose Slurm for its fine-grained control of jobs and its similarity to SGE, which most on-premises bioinformatics environments, including ours, rely on.
- Infrastructure Automation
- AWS CloudFormation
Problem-Solving in Action
1. Eliminating On-Prem Resource Bottlenecks
Problem:
Large-scale genomic analysis jobs frequently stalled due to limited compute resources and fixed hardware capacity in the on-premises HPC environment. During peak processing periods, job queues grew rapidly, causing prolonged wait times and underutilization of parallel execution opportunities. Because infrastructure capacity could not be expanded dynamically, the system was unable to respond effectively to sudden spikes in workload demand.
Solution:
A cloud-based HPC cluster was designed using AWS ParallelCluster, enabling compute resources to scale automatically based on job demand. By integrating Slurm with cloud-native autoscaling, compute nodes were provisioned on demand when jobs entered the queue and terminated once processing completed. This approach ensured stable throughput, eliminated queue congestion, and allowed large workloads to be processed without manual intervention or infrastructure planning.
2. Ensuring TAT Completion During Peak Workloads
Problem:
High-volume genomic projects required strict turnaround time (TAT) guarantees, with minimal tolerance for processing delays. Under the previous environment, peak workloads often resulted in scheduling contention, making it difficult to predict completion times and meet customer commitments.
Solution:
Job scheduling workflows were optimized using Slurm by tuning queue configurations, job priorities, and resource allocation strategies. Multiple scheduling scenarios were tested to identify configurations that maximized parallel execution while maintaining fair resource distribution. As a result, job execution became more predictable, wait times were reduced, and TAT targets were consistently met, even during peak processing periods.
3. Choosing the Most Optimized Head Node Instance
Problem:
Unlike compute nodes, the head node in an HPC cluster plays a critical role in job scheduling, resource orchestration, and cluster stability. An underpowered head node can become a performance bottleneck, leading to delayed job submissions, increased scheduling latency, slower management operations, and even job failures due to network communication issues between the head node and compute nodes. Initial configurations revealed that default instance selections were not sufficient to handle large-scale workloads efficiently.
Solution:
The architecture should be designed to be loosely coupled in order to minimize stress on the head node. It is important to analyze stress, network, and other resource usage to identify the primary bottleneck. Based on this analysis, an instance type should be selected that addresses the identified bottleneck resource, ensuring that communication and job management do not become limiting factors. This approach enables the identification and deployment of the most optimized head node instance.
4. Managing Slurm Job Scheduler
Problem:
When using AWS Batch as the job scheduler, users have limited control over individual job scheduling and resource allocation, as AWS manages these aspects automatically. While this simplifies management, it restricts the ability to fine-tune job execution and resource usage. In contrast, adopting Slurm as the scheduler provides granular control over job priorities, resource allocation, and scheduling policies. However, effective use of Slurm requires a deep understanding of its architecture, including the roles and interactions of various daemons (such as slurmctld and slurmd), as well as Slurm-specific log tracking, configuration, and troubleshooting. This level of detail is primarily necessary for the architect or system administrator responsible for designing and maintaining the cluster, rather than for end users or general operations staff.
Solution:
To overcome these challenges, I proactively invested time in learning the intricacies of Slurm, starting from its basic architecture to advanced configuration and troubleshooting. I studied official documentation, explored community forums, and experimented with test environments to understand the roles and interactions of key daemons (such as slurmctld and slurmd), as well as best practices for log tracking and resource management. Whenever I encountered issues, I systematically analyzed logs, consulted relevant resources, and iteratively refined the cluster configuration until optimal performance and stability were achieved.
Through this hands-on approach, I developed the expertise needed to manage the entire Slurm environment independently. This allowed me to shield the research team from operational complexity, enabling them to focus solely on job submission while I handled all aspects of system administration, monitoring, and troubleshooting. My continuous learning and problem-solving ensured that the team could reliably leverage Slurm’s advanced scheduling capabilities without being burdened by the underlying technical details.
