Machine Learning + Biomedical Sciences

Open Networks and Big Data Lab

Research Focus

Developing robust machine learning methods to extract meaningful biological insights from noisy, sparse, and high-dimensional biomedical data.

Our research develops robust machine learning methods to extract meaningful biological insights from challenging biomedical data across two primary areas. In single-cell genomics, we tackle extreme sparsity (over 90% zero values), technical noise from dropout events, and the inherent difficulty of distinguishing true biological signals from technical artifacts in scRNA-seq data. In clinical imaging, we address the challenge of learning from imprecise medical annotations to achieve high-precision lesion detection and segmentation. Both research directions combine statistical rigor with domain-specific interpretability to advance biomedical discovery and clinical applications.

Gene Regulatory Networks

Inferring causal regulatory relationships between transcription factors and target genes from noisy single-cell data

Zero Imputation

Distinguishing biological zeros from technical dropouts while preserving genuine expression patterns

Clinical Imaging

Learning from imprecise medical annotations to identify lesions and boundaries with high precision

Key Methods & Contributions

BEACON: Bayesian Contrastive Learning for Gene Regulatory Network Inference

NeurIPS Workshop AI4D3 2025

A novel method for inferring gene regulatory networks from static scRNA-seq data that addresses the fundamental challenges of sparsity and noise through a principled combination of three key components. The method uses asymmetric dual embeddings where separate source and target embeddings capture the directional nature of regulatory relationships (TF → gene), enabling the model to distinguish regulators from targets. It employs soft nearest neighbor contrastive learning that brings embeddings of known regulatory pairs closer while pushing non-regulatory pairs apart, learning robust representations despite noise. Finally, Gaussian Process Bayesian scoring provides calibrated probability scores with uncertainty quantification, crucial for prioritizing experimental validation.

The method achieves 97.24% AUROC on SERGIO-simulated E. coli networks and maintains competitive performance across BEELINE benchmarks with ChIP-seq and perturbation validation. Ablation studies confirm all three components are essential for performance.

Transcription Factor Cascade Compendium

Genes 2025

A comprehensive knowledge graph of transcription factor regulatory cascades assembled from multiple curated sources, enabling systematic analysis of regulatory pathways. The compendium integrates tens of thousands of TF→TF regulatory relationships from literature and databases. Graph machine learning techniques identify high-influence transcription factors linked to disease pathways, providing a framework for drug target nomination and hypothesis generation.

Consensus-Based Zero Imputation for scRNA-seq

Ongoing

Addresses the critical challenge of zero-inflation in single-cell data through a conservative ensemble approach that preserves biological zeros while recovering technical dropouts. The method employs multi-method consensus, only imputing zeros when multiple independent methods agree, reducing false positive imputations that can distort biological signals. It uses an iterative marker-gene refinement approach with known cell-type markers to guide imputation, iteratively improving confidence thresholds while maintaining conservative standards.

Comprehensive empirical validation across nine Crohn's disease datasets reveals substantial disagreement among existing methods, with concordance often below 70%. The approach shows that 91.4% of tested marker genes have improved F1 scores. The iterative algorithm converges in 4 iterations, increasing matrix density from 9% to 28% while preserving cell-type identities.

Contra: Contrarian Statistics for Controlled Variable Selection

AISTATS 2021

Mixture of two "contrarian" models (true vs. null-swapped) yields stronger FDR control when covariate models are misspecified. Maintains asymptotic power 1; more reliable p-values than calibrated HRTs; scalable to high dimensions/large n. Demonstrated on synthetic and genetic datasets with improved rigor and efficiency.

LatentCADx: Precision Medical Imaging with Imprecise Annotations

Frontiers in Big Data 2021

Joint classification and segmentation framework that learns from coarse clinical annotations to identify lesions with high precision, addressing the reality of imperfect medical labeling.

Publications

  1. "BEACON: Bayesian Contrastive Learning for Single-Cell Gene Regulatory Inference."
    Yunwei Zhao, Ankit Bhardwaj, Lakshminarayanan Subramanian.
    NeurIPS Workshop on AI Virtual Cells and Instruments: A New Era in Drug Discovery and Development (AI4D3), 2025.
    OpenReview
  2. "Generation of a Compendium of Transcription Factor Cascades and Identification of Potential Therapeutic Targets using Graph Machine Learning."
    Sonish Sivarajkumar, Pratyush Tandale, Ankit Bhardwaj, Kipp W. Johnson, Anoop Titus, Benjamin S. Glicksberg, Shameer Khader, Kamlesh K. Yadav, Lakshminarayanan Subramanian.
    Genes, 2025.
    arXiv
  3. "Limitations of scRNA-seq Zero-Imputation Methods for Network Inference."
    Ankit Bhardwaj, Joshua Weiner, Preetha Balasubramanian, Lakshminarayanan Subramanian.
    ICML Workshop on Machine Learning for Life and Material Science: From Theory to Industry Applications (ML4LMS), 2024.
    OpenReview
  4. "Contra: Contrarian Statistics for Controlled Variable Selection."
    Mukund Sudarshan, Aahlad Puli, Lakshminarayanan Subramanian, Sriram Sankararaman, Rajesh Ranganath.
    AISTATS, 2021.
    Proceedings
  5. "High Precision Mammography Lesion Identification from Imprecise Medical Annotations (LatentCADx)."
    Ulzee An, Ankit Bhardwaj, Khader Shameer, Lakshminarayanan Subramanian.
    Frontiers in Big Data, 2021.
    PMC
  6. "Fast Kernel-based Association Testing of non-linear genetic effects for Biobank-scale data."
    Boyang Fu, Ali Pazokitoroudi, Mukund Sudarshan, Lakshminarayanan Subramanian, Sriram Sankararaman.
    Nature Communications, 2023.
    Link
  7. "Deep Significance Clustering: a novel approach for identifying risk-stratified and predictive patient subgroups."
    Yufang Huang, Yifan Liu, Peter A D Steel, Kelly M Axsom, John R Lee, Sri Lekha Tummalapalli, Fei Wang, Jyotishman Pathak, Lakshminarayanan Subramanian, Yiye Zhang.
    Journal of the American Medical Informatics Association (JAMIA), 2021.
    Full Text
  8. "The Importance of Long-Term Care Populations in models of COVID-19."
    Karl Pillemer, Lakshminarayanan Subramanian, Nathaniel Hupert.
    Journal of American Medical Association (JAMA), 2020.
    Link
  9. "Sepsis in the era of data-driven medicine: personalizing risks, diagnoses, treatments and prognoses."
    Andrew C Liu, Krishna Patel, Ramya Dhatri Vunikili, Kipp W Johnson, Fahad Abdu, Shivani Kamath Belman, Benjamin S Glicksberg, Pratyush Tandale, Roberto Fontanez, Oommen K Mathew, Andrew Kasarskis, Priyabrata Mukherjee, Lakshminarayanan Subramanian, Joel T Dudley, Khader Shameer.
    Briefings in Bioinformatics, 2020.
    Link
  10. "Quantifying the impact of dengue containment activities using high-resolution observational data."
    Nabeel Abdur Rehman, Henrik Salje, Moritz U G Kraemer, Lakshminarayanan Subramanian, Simon Cauchemez, Umar Saif, Rumi Chunara.
    PLOS Neglected Tropical Diseases, 2020.
    Link
  11. "Fine-Grained Dengue Forecasting using Telephone Triage Services."
    Nabeel Abdur Rehman, Shankar Kalyanaraman, Talal Ahmad, Fahad Pervaiz, Umar Saif, Lakshminarayanan Subramanian.
    Science Advances, 2016.
    PDF

Research Team

Current Members
  • Yunwei Zhao - PhD Student, Courant Institute, NYU
  • Ankit Bhardwaj - PhD Student, Courant Institute, NYU
  • Lakshminarayanan Subramanian - Professor, Courant Institute, NYU
Collaborators
  • Rajesh Ranganath - NYU
  • Sriram Sankararaman - UCLA
  • Yiye Zhang - Weill Cornell Medicine
  • Shameer Khader - Sanofi
  • Rumi Chunara - NYU

Software & Resources