.. _index:

=========================================================
Welcome to scDoRI: Single-cell Deep Multi-Omic Regulatory Inference
=========================================================

.. image:: _static/scdori_schematic_main.png
   :align: center
   :width: 80%
   :alt: scDoRI Schematic

**scDoRI** jointly models single-cell RNA-seq and ATAC-seq multi-ome data to infer
**enhancer-mediated gene regulatory networks (eGRNs)**. It couples an
**Encoder-Decoder** neural architecture with mechanistic constraints
(enhancer-gene links, TF activators/repressors), yielding
**topics** of co-accessible peaks, co-expressed genes, TF regulators and their enhancer-mediated downstream targets.
By training in mini-batches, scDoRI handles large datasets while capturing
continuous, cell-specific changes in gene regulation.

Key Highlights
--------------

- **Unified** approach: single model for dimensionality reduction + eGRN inference
- **Biological insights**:  identifies lower dimensional topics, candidate enhancer-gene links, co-regulated gene programs, TF-gene networks per topic
- **Continuous eGRN Modelling without predefined clusters**: each cell is a mixture of regulatory topics, allowing assessment of fine-grained changes in regulatory programs
- **Scalable**: mini-batch training for large single-cell multiome datasets


Input Requirements
------------------

scDoRI expects **single-cell multiome data** with the following inputs:

- **RNA**: an AnnData `.h5ad` object with a **cells by genes** raw expression counts matrix  
- **ATAC**: an AnnData `.h5ad` object with a **cells by peaks** raw Tn5 insertion counts matrix  
  - Peaks must include genomic coordinates in `.var` with columns: `chr`, `start`, and `end`

These datasets must be paired i.e., RNA and ATAC should come from the **same cells**.

The example notebooks provided in this repository are built using the **mouse gastrulation dataset** from:

- Argelaguet et al., BioRxiv 2022: https://www.biorxiv.org/content/10.1101/2022.06.15.496239v1  
- Dataset download link: https://www.dropbox.com/scl/fo/9inmw43pz2bygtqepxl82/ALeeNjuEqw4qp0L9Z9t71xo/data/processed?rlkey=5ihgkvafegkke9jnldlnhw1x6&subfolder_nav_tracking=1&st=cixvwynt&dl=0


Model Architecture and training
-------------

See the :doc:`method_overview` for descriptions on core features of the model including encoder--decoder design, reconstruction tasks and training scheme.


Project Layout
--------------

- **preprocessing_pipeline/**  
  Scripts + a `config.py` for data filtering, highly variable peak/gene/TF selection. Also computes in-silico ChIP-seq matrix.

- **scdori/**  
  Core scDoRI model code + another `config.py` for hyperparameters 
  (number of topics, learning rate, sparsity, etc.).

- **notebooks/**  
  - `preprocessing.ipynb`: Load & filter multi-ome data, obtain in-silico ChIP-seq matrix and other preprocessing steps.
  - `training.ipynb`: Train the scDoRI autoencoder with mini-batches, produce eGRN outputs.

- **environment.yml**  
  Conda environment specifying dependencies (scanpy, pytorch, etc.).

- **cisbp_motif_file**  
  Example motif DB for mouse/human. If you use a custom motif file, 
  set the path in the config.

Installation and Usage
----------------------

1. **Clone** this repo + create the environment:

   .. code-block:: bash

      git clone https://github.com/saraswatmanu/scDoRI.git
      cd scDoRI
      conda env create -f environment.yml
      conda activate scdori_env

2. **Edit** config files:
   - `preprocessing_pipeline/config.py` to specify location of RNA and ATAC anndata .h5ad files, motif file, and set number of peaks/genes/TFs to train on.
   - `scdori/config.py` for scDoRI hyperparameters (number of topics, learning rate, epochs etc.)

3. **Run** notebooks in order:
   - `notebooks/preprocessing.ipynb`
   - `notebooks/training.ipynb`

.. caution::
   If using a mouse dataset, set ``species = "mouse"`` in config. 
   For human, change accordingly and update your motif file path (cisbp or custom).
   Ensure consistent schema in motif meme file compared to the example cisbp file provided.

Tutorial Notebooks
------------------

.. grid:: 2
   :gutter: 2

   .. card:: Preprocessing (Notebook 1)
      :link: notebooks/preprocessing
      :link-type: doc

      - **Filter** to highly variable genes/peaks/TFs
      - **Compute** in-silico ChIP-seq from your motif DB, peak-gene distances
      - **Output** processed data, insilico-chipseq matrix, peak-gene distances

   .. card:: Training (Notebook 2)
      :link: notebooks/training
      :link-type: doc
    
      - **Train** model with mini-batches
      - **Infer** topics and TF–gene networks
      - **Downstream analysis** using inferred eGRNs and topic activities


Hyperparameter and feature selection guide
-------------

See the :doc:`training_guide` page for documentation for guidance on choosing number of features(peaks, genes, TFs) and hyperparameters(number of topics, regularisation etc)

API Reference
-------------

See the :doc:`api_reference` page for documentation on:

- **preprocessing_pipeline** scripts
- **scdori** model scripts

These detail function usage, parameters, and advanced features.

License & Citation
------------------

This project is under MIT License. If scDoRI aids your research, please cite our 
upcoming publication. For questions, open a GitHub Issue or email the maintainers.

.. toctree::
   :maxdepth: 2
   :hidden:

   method_overview
   training_guide
   api_reference