Macrogen: Genetic Counseling Chatbot (GenTok AI; chatGENE AI)

Macrogen: Genetic Counseling Chatbot (GenTok AI; chatGENE AI)

Tag
Macrogen
LLM
RAG
ENG
Tech Stack
OpenAI, Python, Streamlit, LangChain, ChromaDB, AWS

🧑Example of AX from the Customer Perspective: Genetic Counseling Chatbot (GenTok AI; chatGENE AI)

notion image

Executive Summary

With the rapid growth of direct-to-consumer (DTC) genetic testing, services like GenTok AI and chatGENE AI deliver genetic information and test results directly to customers. This expansion, however, has created new challenges: a dramatic increase in counselor workload, the need for specialized knowledge to interpret complex results, and difficulties in searching for information within lengthy PDF reports.

Key Contributions

  • A. Problem Identification and Project Planning: Identified the core challenge of scaling genetic counseling services to meet rising DTC demand and led the overall project planning and strategy.
  • B. LLM Chatbot Development: Designed and developed a domain-specific chatbot using the OpenAI API, ensuring accurate, accessible, and domain-restricted explanations of genetic test results.
  • C. Web Interface Implementation: Built an interactive web application using Streamlit, enabling seamless user interaction and rapid deployment.
  • D. Cloud Deployment and Infrastructure: Deployed the solution on AWS, utilizing AMI, ASG, and ALB for scalability, reliability, and maintainability, while also managing cloud security and monitoring.

Achievements

  • Successfully launched the AI service, enabling customers to ask follow-up questions about their genetic test results and receive clear, domain-accurate answers.
  • Significantly reduced the workload on human counselors through automation.
  • Improved overall customer experience and satisfaction.
  • Established a scalable solution essential for maintaining service quality as the business expanded into the DTC market.

Introduction, Problem, and Goal

Introduction

Advancements in sequencing and array technologies have led to a consistent decrease in the cost of genetic testing each year. As genetic testing becomes more affordable and our understanding of genetics deepens, the concept of utilizing genetic information to achieve personalized medicine and healthcare is rapidly gaining traction. This trend has fueled the growth of direct-to-consumer (DTC) genetic testing, which is now expanding beyond traditional clinical DNA testing. At Macrogen, we provide genetic counseling support for individuals who have undergone genetic testing. However, the increasing popularity of DTC genetic testing has introduced new challenges, particularly in managing the growing demand for counseling and ensuring that customers can understand the often complex information contained in their genetic test results.

Problem

  • The surge in DTC genetic testing has significantly increased the workload for genetic counselors, making it difficult to provide timely and high-quality support to all clients.
  • Genetic test result reports often contain complex information that requires specialized knowledge to interpret, which many customers find difficult to understand without expert guidance.

Goal

  • To automate and enhance the genetic counseling process using Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) technologies, thereby reducing the burden on human counselors.
  • To provide clear, accurate, and accessible explanations of genetic test results, empowering customers to understand their genetic information regardless of their background knowledge.
  • To maintain or improve the quality and accuracy of information provided to customers while scaling counseling services to meet increasing demand.

Technical Overview

  • Large Language Models (LLMs)
    • OpenAI (primary LLM)
    • Llama (PoC stage)
    • Gemma (PoC stage)
  • Frameworks
    • Streamlit
  • Data Processing & Storage
    • LangChain (for document embedding and retrieval, initial phase)
    • ChromaDB (vector database, initial phase)
      • Parquet format (for data storage in ChromaDB)
    • OpenAI native file search (current phase)
  • Cloud Infrastructure (AWS)
    • AWS AMI (Amazon Machine Image, for easy updates)
    • AWS ASG (Auto Scaling Group, for scalability)
    • AWS ALB (Application Load Balancer, for scalability and routing)
    • AWS WAF (Web Application Firewall, for security)
    • AWS CloudWatch (for logging and monitoring)

Problem-Solving in Action: Insights from Overcoming Project Hurdles

1. Framework Selection and Frontend Development

Problem:
My limited experience with frontend development made it difficult to quickly build an interactive web application for LLM-based services. I needed a solution that would allow me to leverage my Python expertise without getting bogged down by frontend complexities.
How I Solved It:
I systematically evaluated several frameworks—Streamlit, Dash, and Gradio—by building small prototypes with each. Through this process, I identified that Streamlit provided the most seamless integration with Python and the fastest path to a functional, interactive interface. This allowed me to focus my efforts on LLM integration and backend logic, rather than spending excessive time on UI development.

2. Ensuring Domain Accuracy and Restriction in LLMs

Problem:
In the biology and healthcare domain, it is critical that LLM outputs are both accurate and restricted to the appropriate context. LLMs can sometimes generate off-topic or imprecise responses, which is unacceptable in this field.
How I Solved It:
I implemented a Retrieval-Augmented Generation (RAG) approach, embedding user test results and internal documentation directly into the model’s context. I also invested significant effort in prompt engineering, iteratively refining prompts to guide the model’s responses. To further safeguard accuracy, I considered integrating a verdict AI to validate outputs. This multi-layered approach ensured that the LLM’s responses remained both accurate and domain-specific.

3. Addressing the “Lost in the Middle” Problem

Problem:
When processing long documents, LLMs sometimes overlook or forget information presented in the middle sections, a phenomenon known as “lost in the middle.” This risked missing critical details in user reports or lengthy inputs.
How I Solved It:
I tackled this by chunking documents into smaller, logically organized sections and carefully structuring the context fed to the model. I also prioritized the placement of key information to ensure it was always within the model’s attention window. Additionally, I kept up with advancements in LLM architectures, adopting newer models as they improved context handling.

Sources

연합뉴스연합뉴스마크로젠, 유전자 AI 상담 서비스 오픈 베타 시작 | 연합뉴스
THE AITHE AI[창간 5주년 특집] 이승빈 마크로젠 CSO “30억 개 유전자, AI가 읽고 정밀의학이 답한다”
itbusinesstodayitbusinesstodayKEAN Health Launches AI Search for Genetic Testing
PR TIMESPR TIMES日本初!AI検索が遺伝子検査に搭載|全てを備えた遺伝子検査「chatGENE Pro(チャットジーン プロ)」