About me

I am a Senior AI/ML Engineer and Machine Learning Platform specialist with a Ph.D. in Computational Genomics. My work focuses on building large-scale machine learning systems that bridge research and production, with applications spanning healthcare, genomics, and biological data science.

Currently, I work at GSK on end-to-end ML infrastructure for training and deploying sequence-based foundation models for RNA and single-cell data, enabling scalable experimentation, distributed training, and production-grade inference systems. Previously at Freenome , I built distributed deep learning platforms for cancer detection using cfDNA and multi-omics data, and before that at 23andMe , I contributed to population-scale genetics systems including Recent Ancestor Location modeling and feature engineering pipelines for polygenic risk scoring used across millions of users.

Earlier in my career, I worked as a Bioinformatician at Scripps Research Institute , developing reproducible genomic pipelines, and as a Machine Learning Consultant at Juno Diagnostics , applying deep learning to prenatal genetic analysis. My Ph.D. research at UC San Diego focused on transcriptional gene regulation in C. elegans, combining experimental biology with early deep learning methods for sequence modeling and interpretability.

Outside of work, I enjoy building and shipping creative technical projects. I’ve been independently developing a VR space-combat game, Rogue Stargun , which explores real-time systems, physics simulation, and GPU-optimized rendering in Unity. I also enjoy painting, hiking, and experimenting with game development and new programming systems in my spare time.

What i'm doing

  • design icon

    Machine Learning & Data Systems

    Build systems that use data to train models and make useful predictions.

  • Web development icon

    Python

    High-quality development of sites, backend services, data tools at the professional level.

  • mobile app icon

    Rust

    Build fast and reliable software at the professional level.

  • camera icon

    DevOps & Cloud Infrastructure

    I set up and manage cloud systems so applications run smoothly and can scale.

Resume

Education

  1. UC San Diego

    2010 - 2017

    Ph.D Biology • Studied transcriptional regulation in C. Elegans

  2. Cornell University

    2006 - 2010

    B.A. Biological Sciences Genetics and Development, Magna Cum Laude • Hughes Scholar Program

Experience

  1. Senior AI/ML Engineer — GSK

    10/2024 — Present

    Built platform for end-to-end finetuning of sequence foundation models, enabling internal teams to rapidly improve models for RNA property prediction from nucleotide sequences, perturbation prediction from single-cell gene expression data (scRNA), and more.

    Finetuned sequence foundation models for perturbation prediction from scRNA data to drive progress on understanding the mechanisms behind respiratory diseases.

    Developed RAG based LLM pipeline used to select ~1000 candidate genes for lung epithelial CRISPR knockout assay for development of next-generation respiratory drugs.

    Developed novel joint embedding (DINO, JEPA) and contrastive learning approaches for large scale (>20 million) image datasets to drive large scale screens.

    Created MCP servers for giving internal AI Science oriented LLM tools the capability to help scientists build and trigger sophisticated large scale embedding models, company-wide.

  2. Senior Machine Learning Engineer — Freenome

    09/2022 — 06/2024

    Led greenfield project building end-to-end scaleable distributed machine learning platform using PyTorch, Ray, and Kubernetes for cancer detection from deep sequencing (methylated DNA) and protein data, enabling training of much larger models leveraging data distributed parallel (DDP) processing speeding up model training by >10x.

    Deployed and managed an organization-wide MLFlow based model tracking system using Terraform, Pulumi, and Google Cloud enabling live-monitoring of deep learning model training progress, instantaneous results sharing, and completely automated and reproducible report generation - reducing researcher manual effort by at least 5x.

    Built scaleable multitask learning, elastic net, and neural network based models in PyTorch with improved performance for classifying Colorectal Cancer risk from cell-free DNA data for a clinical trial cohort of >27,000 individuals.

    Piloted a project to summarize biomedical literature using LLMs, first using GPT-4 and later by fine-tuning an open source LLM via DPO (direct preference optimization), demonstrating the viability of using LLMs to parse unstructured biomedical records for scaling up feature extraction.

  3. Machine Learning Engineer & Data Scientist— 23andMe

    11/2018 — 08/2022

    Created and deployed into production Recent Ancestor Locations (RAL) - a high precision, high recall country matching algorithm which serves >15 million customers worldwide.

    Improved graph-based techniques for unsupervised identification of populations by genetically based identity-by-descent (IBD) family relationship, demonstrating an effective way to segment sub-populations (graph community detection) in Mexico and the United Kingdom in an semi-unsupervised manner.

    Built a large-scale feature engineering ETL pipeline for imputed SNPs (~10 million samples x ~1 million SNPs) using AWS Batch, Metaflow, AWS Glue, and AWS Athena enabling creation of higher quality GWAS and Polygenic Risk Score (PRS) ML models.

    Developed improved models for type 2 diabetes and Coronary Artery Disease by building and evaluating model stacking ensembles into production PRS pipelines, improving the sensitivity and specificity of 23andMe tests for tens of thousands of customers.

    Automated performance metric report generation for all polygenic risk score classifiers leveraging MLFlow artifact storage and headless Jupyter execution, reducing researcher time spent on analysis from days to minutes.

  4. Bioinformatician — Scripps Research

    05/2018 — 10/2018

    Developed a classifier for organ transplant rejection using RNA data and wrote pipelines for Nanopore long-read sequencers using Common Workflow Language.

  5. Machine Learning Consultant — Juno Diagnostics

    09/2017 — 02/2018

    Developed patent – US20210020314A1 - Deep learning-based methods, devices, and systems for prenatal testing along with a Tensorflow based classifier for detecting prenatal genetic abnormalities from high throughput sequencing data.

  6. Data Science Fellow — Insight Data Science

    01/2017 — 04/2017

    Built and deployed (as a Flask app on AWS EC2) DeepPixelMonster - a Tensorflow based GAN for creating pixel art, back when GANs were still relatively state-of-the art.

  7. Ph.D. Research — Transcriptional Gene Regulation (C. elegans)

    08/2010 — 05/2017

    Wrote DeepNuc - a CNN model for classifying over 500,000 transcriptional start site (TSS) flanking sequences from humans, mice, fruit flies, and nematodes as well as for over 60,000 microRNA target sequences.

    Researched the role of RNA expression noise during animal development by imaging single molecule RNA expression data in >5,000 embryos and analyzing data using self-written MATLAB tools for image segmentation, fluorescence quantification, and image deconvolution.

My skills

  • Machine Learning & Data Systems
    92%
  • Python
    95%
  • Rust
    90%
  • DevOps & Cloud Infrastructure
    83%

My interests

  • Game Development
    80%
  • AI Systems
    85%
  • Technical Writing / Blogging
    90%
  • History
    83%

Blogs

Contact

Contact Form