Macrogen: Global BI using AWS Batch to accomplish centralized and scalable automation pipeline

Powering a Global BI Platform with Scalable, Simplified Deployment Using AWS Batch and Amazon ECS (Docker)

Executive Summary

Macrogen is a global firm in which each branch conducts its own sequencing and data analysis due to GDPR and other regulations regarding region-specific data handling, storage, and analysis. This has led to a fragmented analysis system across different branches. To solve this problem and automate the workflow to improve productivity, I led and architected a global BI platform using AWS Batch.

Key Contributions

Area	Contribution	Description
Investigate Cloud Scalable Pipeline	80%	Led the evaluation and selection of cloud-native pipeline solutions.
Infrastructure as Code Design	100%	Defined full AWS resource architecture with Terraform for reproducibility and scalability.
Architect AWS Infra for Pipeline Automation	100%	Implemented event-driven job automation using AWS Batch and AWS Lambda.
Container-Oriented Microservices	100%	Built Docker-based analysis pipeline enabling modular execution and fast deployment.
Performance Optimization (Infrastructure/Batch)	100%	Monitored resource utilization and optimized batch job behavior using pipeline metrics.
Bioinformatics Pipeline Optimization	70%	Optimized the core bioinformatics pipeline

Achievements

Enabled automated job execution, significantly improving pipeline efficiency.

Built containerized microservice architecture deployable via a single command.

Generated new commercial service opportunity through scalable global pipeline design.

Introduction, Problem, and Goal

Introduction

Global bioinformatics services require tracking sequencing samples across regions, handling diverse processing pipelines, and reliably delivering results to partner organizations. As the scale of metagenome shotgun sequencing expanded, the existing infrastructure struggled to process increasing data volumes, leading to delays and difficulties in automation and scalability.

To solve these challenges, I designed and built a Global BI Platform using AWS Batch, enabling automated and scalable execution of containerized bioinformatics pipelines. The system utilizes AWS Batch and Lambda for dynamic job submission triggered by S3 file uploads, and Docker-based microservices for modular pipeline execution. Infrastructure resources were defined through Terraform for reproducible and maintainable deployments.

The platform dramatically improved data processing efficiency and reliability, enabling automated job execution and single-command deployment. It is now scheduled to launch as a commercial service supporting global customers.

Problem

The legacy pipeline could not scale to meet the rising global sequencing volume.

Manual job execution and resource bottlenecks created delays.

Each branch of the company maintained its own pipelines, increasing development and management costs.

Lack of automation made global bioinformatics operations inefficient and error-prone.

Goal

Build a scalable global BI processing platform that automatically schedules and runs tasks.

Support event-triggered workflows and dynamic resource allocation.

Improve pipeline processing speed and deployment scalability.

Technical Overview

Job Scheduling & Automation

AWS Batch + AWS Lambda for event-driven workflow orchestration

Containerized Pipeline

Docker-based microservices for modular and reproducible workload execution

Monitoring & Optimization

Resource-based optimization and automated failure recovery scripts

Problem-Solving in Action

1. Automating Analysis Pipelines

Problem:

The existing bioinformatics pipelines relied heavily on manual job submission. Engineers were required to monitor input data arrival and trigger processing jobs by hand, which slowed execution, introduced human error, and created inconsistent processing timelines across regions. This manual dependency became increasingly unsustainable as sequencing volume and operational scale grew.

Solution:

An event-driven automation model was implemented using AWS Lambda and AWS Batch. When input files were uploaded to Amazon S3, Lambda automatically triggered the appropriate Batch jobs without human intervention. This ensured consistent execution, reduced processing latency, and eliminated manual operational steps while maintaining full traceability of job execution.

2. Scaling Infrastructure for Large Datasets

Problem:

As global sequencing volume increased, the legacy processing systems struggled to handle large and variable workloads. Static infrastructure frequently became saturated during peak periods, resulting in delayed processing and inefficient resource utilization.

Solution:

A microservice-based architecture was designed using Docker to encapsulate individual analysis steps. AWS Batch was used to provision compute resources dynamically at the job level, enabling on-demand scaling based on workload size and complexity. This approach allowed the system to efficiently process large datasets, absorb sudden workload spikes, and scale down automatically when demand decreased.

3. Deployment & Maintainability

Problem:

Pipeline updates and infrastructure changes required extensive manual reconfiguration across environments. This increased deployment time, introduced inconsistencies between regions, and made maintenance error-prone and costly.

Solution:

The entire platform was defined using Terraform as infrastructure as code. This enabled consistent, repeatable deployments with a single command, ensuring that environments could be provisioned, updated, or replicated reliably across regions. As a result, deployment complexity was significantly reduced, maintenance became more predictable, and global consistency was enforced across all environments.