Problem
Due to differences in server and user environments, headquarters and branch offices are unable to use the same pipeline for identical analyses. This leads to several challenges:
- Separate pipelines must be developed and maintained for each location, resulting in:
- Decentralized pipeline management
- Increased human resource requirements
- Duplicated development and labor costs across locations
- Headquarters lacks a centralized method to manage, deploy, and maintain authority over analysis pipelines.
- In high labor cost regions (e.g., U.S. and Europe), maintaining local staff increases expenses, which could be reduced with centralized management.
Solution
We considered several HPC solutions on AWS, including AWS Batch, AWS ParallelCluster, and HealthOmics.
- AWS ParallelCluster: Excluded due to its lack of flexibility in scalability and cost efficiency compared to AWS Batch. Scalability is a key requirement for this project. The headquarters' pipeline should be usable with simple adjustments to the AWS environment settings at Psomagen, European branches, and other locations. Building HPC environments with ParallelCluster in each region is less scalable and requires more system architecture and resources. Additionally, unnecessary costs may arise from maintaining head nodes, making it less cost-effective than AWS Batch.
- HealthOmics: Excluded due to limited regional availability, which could lead to regulatory issues if requests are made from unsupported regions.
Method
- Use AWS Batch to solve the problem.
- Design the system so that when a user uploads or updates an input.tar file in S3, the analysis is automatically triggered, and the final results are stored back in S3.
- Enable centralized pipeline management, scalability across regions, and efficient resource utilization.
Technical Overview
The
🌍 Example of AX from the Global Operations Perspective — Global BI Platform Using AWS Batch
Executive Summary
Global bioinformatics services require tracking sequencing samples across regions, handling diverse processing pipelines, and reliably delivering results to partner organizations. As the scale of metagenome shotgun sequencing expanded, the existing infrastructure struggled to process increasing data volumes, leading to delays and difficulties in automation and scalability.
To solve these challenges, I designed and built a Global BI Platform using AWS Batch, enabling automated and scalable execution of containerized bioinformatics pipelines. The system utilizes AWS Batch and Lambda for dynamic job submission triggered by S3 file uploads, and Docker-based microservices for modular pipeline execution. Infrastructure resources were defined through Terraform for reproducible and maintainable deployments.
The platform dramatically improved data processing efficiency and reliability, enabling automated job execution and single-command deployment. It is now scheduled to launch as a commercial service supporting global customers.
포트폴리오_영어
Key Contributions
- A. Infrastructure as Code Design: Defined full AWS resource architecture with Terraform for reproducibility and scalability.
- B. Pipeline Automation: Implemented event-driven job automation using AWS Batch + AWS Lambda.
- C. Container-Oriented Microservices: Built Docker-based analysis pipeline enabling modular execution and fast deployment.
- D. Performance Optimization: Monitored resource utilization and optimized batch job behavior using pipeline metrics.
Tech Stack:
AWS Batch, AWS Lambda, Docker, Terraform
Introduction, Problem, and Goal
Introduction
Global sequencing services require fast and automated processing pipelines capable of scaling to high-throughput workloads. As data volume grew rapidly, the existing legacy pipeline struggled to handle workload surges and required manual intervention.
포트폴리오_영어
Problem
- The legacy pipeline could not scale to meet rising global sequencing volume.
- Manual job execution and resource bottlenecks created delays.
- Lack of automation made global bioinformatics operations inefficient and error-prone.
Goal
- Build a scalable global BI processing platform that automatically schedules and runs tasks.
- Support event-triggered workflows and dynamic resource allocation.
- Improve pipeline processing speed and deployment scalability.
Technical Overview
- Infrastructure as Code
- Terraform for automated provisioning of AWS resources
- Job Scheduling & Automation
- AWS Batch + AWS Lambda for event-driven workflow orchestration
- Containerized Pipeline
- Docker-based microservices for modular and reproducible workload execution
- Monitoring & Optimization
- Resource-based optimization and automated failure recovery scripts
Problem-Solving in Action
1. Automating Analysis Pipelines
Problem:
Manual job submissions slowed processing and increased error rates.
Solution:
Implemented automated job submission triggered by S3 uploads using AWS Batch + Lambda.
포트폴리오_영어
2. Scaling Infrastructure for Large Datasets
Problem:
Increasing sequencing volume overwhelmed existing processing systems.
Solution:
Designed microservice-based architecture using Docker and on-demand scaling with AWS Batch.
3. Deployment & Maintainability
Problem:
Complex pipeline updates required extensive manual reconfiguration.
Solution:
Used Terraform for consistent deployment, enabling single-command provisioning across environments.
Achievements
- Enabled automated job execution, significantly improving pipeline efficiency.
- Built containerized microservice architecture deployable via a single command.
- Generated new commercial service opportunity through scalable global pipeline design. 포트폴리오_영어
Project Title
Global BI Platform Using AWS Batch
Roles: AWS system architecture, Infrastructure automation, Pipeline development
