Macrogen: Cloud HPC using AWS ParallelCluster

Macrogen: Cloud HPC using AWS ParallelCluster

Tag
Cloud
HPC
Tech Stack
AWS ParallelCluster, AWS CloudFormation, Slurm

The power of cloud HPC for dynamically handling demands in customer-facing analysis

Executive Summary

 
 

Introduction, Problem and, Solution

Problem
When running ancestry analysis and imputation-based analysis in the 서비스개발분석부, there was insufficient computing resource, which could prevent these or other analyses from being completed within the required turnaround time (TAT). Additionally, the primary consumers of these analyses are customers, not researchers or institutes.
This makes it difficult to allocate the right amount of computing resources:
  • Assigning too many resources can lead to waste if demand decreases.
  • Assigning too few resources can cause TAT issues if demand increases, such as during promotions or periods of high activity.
Therefore, a dedicated and scalable solution for computing resources is required to meet TAT requirements efficiently.
Solution
To address this, we considered several cloud-based HPC solutions on AWS, including AWS Batch, AWS ParallelCluster, and HealthOmics.
  • AWS Batch: Excluded due to its lack of flexibility in editing analysis pipelines and introducing new pipelines unless using Docker containers. This limitation makes it less suitable for rapidly deploying new products or conducting various tests.
  • HealthOmics: Excluded due to limited regional availability, which could lead to regulatory issues if requests are made from unsupported regions.
Method
  • Use AWS ParallelCluster with Slurm as the job scheduler.
  • Design the system so that users log in through our on-premise server and submit jobs as they would with our on-premise SGE HPC.
  • Enable scalability to handle increased demand, while only incurring head node costs when there is no demand, making it ideal for the DTC model.
notion image
 

Technical Overview

 
AWS S
 
 

Challenges and Solutions

The initial idea of
 
 
  1. Framework
  1. Domain restriction
  1. Lost in the middlee

🖥 Example of AX from the Infrastructure Perspective — Cloud HPC Using AWS ParallelCluster

Executive Summary

Large-scale genomic analysis tasks such as ancestry inference and imputation-based analysis require high computational power and must be completed within strict TAT (Turnaround Time) constraints. However, the existing on-premise server infrastructure frequently encountered processing delays due to limited compute resources and static scheduling, causing workflow bottlenecks and extended turnaround times.
To address these challenges, I designed and implemented a Slurm-based Cloud HPC environment using AWS ParallelCluster, enabling dynamic auto-scaling, flexible resource expansion, and stable performance for large workloads. I architected the entire system, automated instance provisioning, and optimized batch processing efficiency through workload-aware configuration.
The new cloud HPC system significantly improved compute scalability and reduced operational costs by scaling based on demand. It enabled guaranteed TAT delivery, supported large nationwide analysis projects, and was internally recognized as a best practice in infrastructure innovation. 포트폴리오_영어

Key Contributions

  • A. Infrastructure Architecture & Design: Built high-performance compute environment using AWS ParallelCluster with Slurm scheduler.
  • B. Workflow Optimization: Tuned batch processing strategies for large-scale genomic analysis workloads.
  • C. Automation & Cost Efficiency: Automated cluster provisioning, scaling, and resource management to optimize cost-to-performance ratio.
  • D. Monitoring & Performance Visualization: Integrated monitoring dashboards using Grafana for real-time performance and resource tracking.
Tech Stack:
AWS ParallelCluster, AWS CloudFormation, Slurm, Grafana

Introduction, Problem, and Goal

Introduction

High-performance computing (HPC) is essential to scale genomic research and clinical analysis. Traditional on-premise cluster environments struggle to support peak computational needs, leading to delays and workflow disruption as project volumes grow. 포트폴리오_영어

Problem

  • On-premise computing hardware lacked flexibility to scale for large analysis batches.
  • Scheduling inefficiencies caused processing delays and failure to meet TAT requirements.
  • Infrastructure expansion required high cost and long lead times, limiting operational agility.

Goal

  • Build a flexible HPC computing environment capable of autoscaling based on workload.
  • Replace manual job scheduling with an automated Slurm-based cloud system.
  • Reduce cost and guarantee TAT through elastic resource management.

Technical Overview

  • HPC Framework
    • AWS ParallelCluster with Slurm scheduler
  • Infrastructure Automation
    • AWS CloudFormation
  • Monitoring & Dashboard
    • Grafana
  • Workload Optimization
    • Performance tuning for batch genomic pipelines

Problem-Solving in Action

1. Eliminating On-Prem Resource Bottlenecks

Problem:
Large analysis jobs frequently stalled due to limited compute resources and static hardware capacity.
Solution:
Designed a cloud HPC cluster that automatically scales resources based on job load, ensuring stable throughput and eliminating processing delays. 포트폴리오_영어

2. Ensuring TAT Completion During Peak Workloads

Problem:
High-volume projects required strict job completion timing with no tolerance for delays.
Solution:
Optimized job scheduling workflows using Slurm and tested multiple configurations to maximize throughput.

3. Reducing Operational Cost

Problem:
Maintaining always-on hardware was expensive and inefficient.
Solution:
Implemented autoscaling to allocate compute only when needed, achieving cost savings.

Achievements

  • Designed full HPC cloud architecture and automated AWS instance provisioning.
  • Improved performance and reduced TAT through compute autoscaling.
  • Achieved cost optimization by scaling compute based on demand.
  • Selected internally as a best practice case for infrastructure innovation and published within the organization. 포트폴리오_영어

Project Title

Cloud HPC Using AWS ParallelCluster
Roles: Full AWS architecture design, Cloud HPC operation & optimization