Machine Learning + Biomedical Sciences

Open Networks and Big Data Lab

Problem Motivation

The explosion of single-cell sequencing and large-scale medical imaging has created a paradox: we have unprecedented data volume, but signals are noisy, incomplete, and difficult to validate. Our lab builds robust machine learning systems that extract actionable biological insight from imperfect data—focusing on gene regulatory networks (GRNs), transcription factor (TF) cascades, and oncology imaging.

Across these fronts, we combine Bayesian inference, contrastive representation learning, and graph machine learning to deliver calibrated predictions, interpretable structure, and practical performance on noisy, real-world data.

Results and Contributions

  1. BEACON: Bayesian Contrastive Learning for GRN Inference

    • Directional embeddings: dual heads (regulator/target) capture asymmetry of regulation.
    • Contrastive training: aligns known TF→target pairs; pushes apart likely negatives.
    • Bayesian edge scoring: Gaussian-process classifier outputs calibrated edge probabilities with quantified uncertainty.
    • Impact: Competitive or superior performance on SERGIO-simulated and real scRNA-seq with ChIP/perturbation validation; ablations show all three components are essential.

  2. What Imputation Really Does: Limits of Zero-Imputation in scRNA-seq

    • Finding: Different zero-imputation pipelines produce drastically different GRNs (very low Jaccard overlap) on the same datasets.
    • Simulation evidence: Under known ground truth, common imputers fail to recover the true network reliably.
    • Takeaway: Avoid fragile preprocessing; prefer uncertainty-aware, non-imputative models (e.g., BEACON) that learn directly from noisy counts.

  3. TF Cascades: A Graph ML Compendium

    • Resource: Knowledge graph of tens of thousands of TF→TF paths assembled from curated sources.
    • Analytics: Centrality and enrichment highlight high-influence TFs (e.g., cancer-linked) and candidate therapeutic pathways.
    • Utility: A reusable backbone for hypothesis generation, drug target nomination, and integration with GRN inference outputs.

  4. LatentCADx: Precision Mammography from Imprecise Annotations

    • Approach: Joint classification + segmentation with a strict containment penalty to respect coarse radiology marks while learning sharper lesion masks.
    • Results: High ROC/AP for case classification and strong segmentation precision; markedly fewer “confused” pixels and crisper boundaries.
    • Clinical value: Reliable lesion localization even when pixel-perfect labels are unavailable—aligned with real hospital data realities.

Members

  1. Yunwei Zhao, NYU.
  2. Ankit Bhardwaj, NYU.

Publications

  1. Ulzee An, Ankit Bhardwaj, Khader Shameer, Lakshminarayanan Subramanian. "High Precision Mammography Lesion Identification from Imprecise Medical Annotations (LatentCADx)." Frontiers in Big Data, 2021. (PDF)
  2. Sonish Sivarajkumar, Pratyush Tandale, Ankit Bhardwaj, Kipp W. Johnson, Anoop Titus, Benjamin S. Glicksberg, Shameer Khader, Kamlesh K. Yadav, Lakshminarayanan Subramanian. "Generation of a Compendium of Transcription Factor Cascades and Identification of Potential Therapeutic Targets using Graph Machine Learning." Genes, 2025. (PDF)
  3. Yunwei Zhao, Ankit Bhardwaj, Lakshminarayanan Subramanian. "BEACON: Bayesian Contrastive Learning for Single-Cell Gene Regulatory Inference." NeurIPS Workshop on AI Virtual Cells and Instruments: A New Era in Drug Discovery and Development (AI4D3), 2025. (Open Review)
  4. Ankit Bhardwaj, Joshua Weiner, Preetha Balasubramanian, Lakshminarayanan Subramanian. "Limitations of scRNA-seq Zero-Imputation Methods for Network Inference" ICML Workshop on Machine Learning for Life and Material Science: From Theory to Industry Applications (ML4LMS), 2024. (Open Review)