Larry Du

About me

I am a Senior AI/ML Engineer and Machine Learning Platform specialist with a Ph.D. in Computational Genomics. My work focuses on building large-scale machine learning systems that bridge research and production, with applications spanning healthcare, genomics, and biological data science.

Currently, I work at GSK on end-to-end ML infrastructure for training and deploying sequence-based foundation models for RNA and single-cell data, enabling scalable experimentation, distributed training, and production-grade inference systems. Previously at Freenome , I built distributed deep learning platforms for cancer detection using cfDNA and multi-omics data, and before that at 23andMe , I contributed to population-scale genetics systems including Recent Ancestor Location modeling and feature engineering pipelines for polygenic risk scoring used across millions of users.

Earlier in my career, I worked as a Bioinformatician at Scripps Research Institute , developing reproducible genomic pipelines, and as a Machine Learning Consultant at Juno Diagnostics , applying deep learning to prenatal genetic analysis. My Ph.D. research at UC San Diego focused on transcriptional gene regulation in C. elegans, combining experimental biology with early deep learning methods for sequence modeling and interpretability.

Outside of work, I enjoy building and shipping creative technical projects. I’ve been independently developing a VR space-combat game, Rogue Stargun , which explores real-time systems, physics simulation, and GPU-optimized rendering in Unity. I also enjoy painting, hiking, and experimenting with game development and new programming systems in my spare time.

What i'm doing

Machine Learning & Data Systems

Build systems that use data to train models and make useful predictions.
Python

High-quality development of sites, backend services, data tools at the professional level.
Rust

Build fast and reliable software at the professional level.
DevOps & Cloud Infrastructure

I set up and manage cloud systems so applications run smoothly and can scale.

Resume

Education

UC San Diego
2010 - 2017
Ph.D Biology • Studied transcriptional regulation in C. Elegans
Cornell University
2006 - 2010
B.A. Biological Sciences Genetics and Development, Magna Cum Laude • Hughes Scholar Program

Experience

Senior AI/ML Engineer — GSK
10/2024 — Present
Built platform for end-to-end finetuning of sequence foundation models, enabling internal teams to rapidly improve models for RNA property prediction from nucleotide sequences, perturbation prediction from single-cell gene expression data (scRNA), and more.

Finetuned sequence foundation models for perturbation prediction from scRNA data to drive progress on understanding the mechanisms behind respiratory diseases.

Developed RAG based LLM pipeline used to select ~1000 candidate genes for lung epithelial CRISPR knockout assay for development of next-generation respiratory drugs.

Developed novel joint embedding (DINO, JEPA) and contrastive learning approaches for large scale (>20 million) image datasets to drive large scale screens.

Created MCP servers for giving internal AI Science oriented LLM tools the capability to help scientists build and trigger sophisticated large scale embedding models, company-wide.
Senior Machine Learning Engineer — Freenome
09/2022 — 06/2024
Led greenfield project building end-to-end scaleable distributed machine learning platform using PyTorch, Ray, and Kubernetes for cancer detection from deep sequencing (methylated DNA) and protein data, enabling training of much larger models leveraging data distributed parallel (DDP) processing speeding up model training by >10x.

Deployed and managed an organization-wide MLFlow based model tracking system using Terraform, Pulumi, and Google Cloud enabling live-monitoring of deep learning model training progress, instantaneous results sharing, and completely automated and reproducible report generation - reducing researcher manual effort by at least 5x.

Built scaleable multitask learning, elastic net, and neural network based models in PyTorch with improved performance for classifying Colorectal Cancer risk from cell-free DNA data for a clinical trial cohort of >27,000 individuals.

Piloted a project to summarize biomedical literature using LLMs, first using GPT-4 and later by fine-tuning an open source LLM via DPO (direct preference optimization), demonstrating the viability of using LLMs to parse unstructured biomedical records for scaling up feature extraction.
Machine Learning Engineer & Data Scientist— 23andMe
11/2018 — 08/2022
Created and deployed into production Recent Ancestor Locations (RAL) - a high precision, high recall country matching algorithm which serves >15 million customers worldwide.

Improved graph-based techniques for unsupervised identification of populations by genetically based identity-by-descent (IBD) family relationship, demonstrating an effective way to segment sub-populations (graph community detection) in Mexico and the United Kingdom in an semi-unsupervised manner.

Built a large-scale feature engineering ETL pipeline for imputed SNPs (~10 million samples x ~1 million SNPs) using AWS Batch, Metaflow, AWS Glue, and AWS Athena enabling creation of higher quality GWAS and Polygenic Risk Score (PRS) ML models.

Developed improved models for type 2 diabetes and Coronary Artery Disease by building and evaluating model stacking ensembles into production PRS pipelines, improving the sensitivity and specificity of 23andMe tests for tens of thousands of customers.

Automated performance metric report generation for all polygenic risk score classifiers leveraging MLFlow artifact storage and headless Jupyter execution, reducing researcher time spent on analysis from days to minutes.
Bioinformatician — Scripps Research
05/2018 — 10/2018
Developed a classifier for organ transplant rejection using RNA data and wrote pipelines for Nanopore long-read sequencers using Common Workflow Language.
Machine Learning Consultant — Juno Diagnostics
09/2017 — 02/2018
Developed patent – US20210020314A1 - Deep learning-based methods, devices, and systems for prenatal testing along with a Tensorflow based classifier for detecting prenatal genetic abnormalities from high throughput sequencing data.
Data Science Fellow — Insight Data Science
01/2017 — 04/2017
Built and deployed (as a Flask app on AWS EC2) DeepPixelMonster - a Tensorflow based GAN for creating pixel art, back when GANs were still relatively state-of-the art.
Ph.D. Research — Transcriptional Gene Regulation (C. elegans)
08/2010 — 05/2017
Wrote DeepNuc - a CNN model for classifying over 500,000 transcriptional start site (TSS) flanking sequences from humans, mice, fruit flies, and nematodes as well as for over 60,000 microRNA target sequences.

Researched the role of RNA expression noise during animal development by imaging single molecule RNA expression data in >5,000 embryos and analyzing data using self-written MATLAB tools for image segmentation, fluorescence quantification, and image deconvolution.