scDORI training#

scdori.train_grn.compute_eval_loss_grn(model, device, train_loader, eval_loader, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot, config_file)[source]#

Compute the validation (evaluation) loss for the GRN phase.

This function evaluates loss components for ATAC, TF, RNA, and RNA-from-GRN on a validation dataset.

Parameters#

modeltorch.nn.Module

The scDoRI model.

devicetorch.device

The device (CPU or CUDA) used for PyTorch tensors.

train_loaderDataLoader

DataLoader for the training set (used to compute TF expression).

eval_loaderDataLoader

DataLoader for the validation set.

rna_anndataanndata.AnnData

RNA single-cell data in AnnData format.

atac_anndataanndata.AnnData

ATAC single-cell data in AnnData format.

num_cellsnp.ndarray

number of cells constituting each input metacell, set to 1 for single cell data

tf_indiceslist of int

Indices of TF features in the RNA data.

encoding_batch_onehotnp.ndarray

One-hot encoding for batch information.

config_filepython file

Configuration file for model training.

Returns#

tuple of float

A tuple containing: (eval_loss, eval_loss_atac, eval_loss_tf, eval_loss_rna, eval_loss_rna_grn).

scdori.train_grn.get_tf_expression(tf_expression_mode, model, device, train_loader, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot, config_file)[source]#

Compute TF expression per topic.

If tf_expression_mode is “True”, this function computes the mean TF expression for the top-k cells in each topic. Otherwise, it uses a normalized topic-TF decoder matrix from the model.

Parameters#

tf_expression_modestr

Mode for TF expression. “True” calculates per-topic TF expression from top-k cells, “latent” uses the topic-TF decoder matrix.

modeltorch.nn.Module

The scDoRI model containing encoder and decoder modules.

devicetorch.device

The device (CPU or CUDA) used for PyTorch tensors.

train_loaderDataLoader

DataLoader for training data.

rna_anndataanndata.AnnData

RNA single-cell data in AnnData format.

atac_anndataanndata.AnnData

ATAC single-cell data in AnnData format.

num_cellsnp.ndarray

number of cells constituting each input metacell, set to 1 for single cell data.

tf_indiceslist of int

Indices of TF features in the RNA data.

encoding_batch_onehotnp.ndarray

One-hot encoding for batch information.

config_filepython file

Configuration object with model training.

Returns#

torch.Tensor

A (num_topics x num_tfs) tensor of TF expression values for each topic.

scdori.train_grn.set_encoder_frozen(model, freeze=True)[source]#

Freeze or unfreeze the encoder parameters.

Parameters#

modeltorch.nn.Module

scDoRI model containing the encoder modules.

freezebool, optional

If True, freeze the encoder parameters; if False, unfreeze them. Default is True.

scdori.train_grn.set_peak_gene_frozen(model, freeze=True)[source]#

Freeze or unfreeze the peak-gene link parameters.

Parameters#

modeltorch.nn.Module

scDoRI model containing the peak-gene factor.

freezebool, optional

If True, freeze the peak-gene parameters; if False, unfreeze them. Default is True.

scdori.train_grn.set_topic_peak_frozen(model, freeze=True)[source]#

Freeze or unfreeze the topic-peak decoder parameters.

Parameters#

modeltorch.nn.Module

scDoRI model containing the topic-peak decoder.

freezebool, optional

If True, freeze the topic-peak decoder; if False, unfreeze it. Default is True.

scdori.train_grn.set_topic_tf_frozen(model, freeze=True)[source]#

Freeze or unfreeze the topic-TF decoder parameters.

Parameters#

modeltorch.nn.Module

scDoRI model containing the topic-TF decoder.

freezebool, optional

If True, freeze the topic-TF decoder; if False, unfreeze it. Default is True.

scdori.train_grn.train_model_grn(model, device, train_loader, eval_loader, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot, config_file)[source]#

Train the model in Phase 2 (GRN phase).

In this phase, the model focuses on learning activator and repressor TF-gene links per topic (module 4 of scDoRI). Other modules of the model can be optionally frozen or unfrozen based on the configuration.

Parameters#

modeltorch.nn.Module

The scDoRI model to train.

devicetorch.device

The device (CPU or CUDA) used for PyTorch tensors.

train_loaderDataLoader

DataLoader for the training set.

eval_loaderDataLoader

DataLoader for the validation set, used to check early stopping criteria.

rna_anndataanndata.AnnData

RNA single-cell data in AnnData format.

atac_anndataanndata.AnnData

ATAC single-cell data in AnnData format.

num_cellsnp.ndarray

number of cells constituting each input metacell, set to 1 for single cell data

tf_indiceslist of int

Indices of TF features in the RNA data.

encoding_batch_onehotnp.ndarray

One-hot encoding for batch information.

config_filepython file

Configuration file for model training.

Returns#

torch.nn.Module

The trained model after the GRN phase completes or early stopping occurs.

Global configuration for the scDoRI modeling pipeline.

This file defines top-level constants and parameters controlling:

  1. Logging

  2. File paths for data and outputs

  3. Model architecture details (numbers of topics, hidden dimensions)

  4. Training phases and hyperparameters

  5. Loss weighting for different data modalities (ATAC, TF, RNA)

  6. Regularization and early stopping settings

  7. Significance testing cutoffs for TF-gene links

  8. UMAP parameters for visualization

Attributes#

logging_levelint

The Python logging level (e.g., logging.INFO).

data_dirPath

Base directory containing the processed anndata and other precomputed files.

output_subdirstr

Subdirectory name within data_dir for storing or accessing outputs.

rna_metacell_filestr

Name of the H5AD file containing RNA data.

atac_metacell_filestr

Name of the H5AD file containing ATAC data.

batch_colstr

Key for the batch column in the AnnData object.

gene_peak_distance_filestr

Filename for the NumPy array containing gene-peak distances.

insilico_chipseq_act_filestr

Filename for the in silico ChIP-seq activator embeddings.

insilico_chipseq_rep_filestr

Filename for the in silico ChIP-seq repressor embeddings.

random_seedint

Random seed for reproducibility.

batch_size_cellint

Batch size for training.

dim_encoder1int

Dimension of the first encoder layer.

dim_encoder2int

Dimension of the second encoder layer.

num_topicsint

Number of latent topics for scDoRI.

batch_size_cell_predictionint

Batch size when making predictions in eval mode (e.g., forward passes only).

epoch_warmup_1int

Number of epochs to run “warmup_1” (ATAC+TF) before adding RNA.

max_scdori_epochsint

Maximum number of epochs for the scDoRI phase 1 training (module 1,2,3).

max_grn_epochsint

Maximum number of epochs for the GRN training phase (module 4).

update_encoder_in_grnbool

Whether to unfreeze the encoder during the GRN phase.

update_peak_gene_in_grnbool

Whether to unfreeze the peak-gene links in the GRN phase.

update_topic_peak_in_grnbool

Whether to unfreeze the topic-peak links in the GRN phase.

update_topic_tf_in_grnbool

Whether to unfreeze the topic-TF links in the GRN phase.

eval_frequencyint

How often (in epochs) to evaluate validation loss.

phase1_patienceint

Early stopping patience (in epochs) for phase 1 training (module 1,2 3).

grn_val_patienceint

Early stopping patience (in epochs) for the GRN phase.

learning_rate_scdorifloat

Learning rate for scDoRI phase 1 training (module 1,2 3).

learning_rate_grnfloat

Learning rate for the GRN training phase.

weight_atac_phase1float

Loss weight for ATAC reconstruction in Phase 1:warmup_1.

weight_tf_phase1float

Loss weight for TF reconstruction in Phase 1:warmup_1.

weight_rna_phase1float

Loss weight for RNA reconstruction in Phase 1:warmup_1. set to 0.

weight_rna_grn_phase1float

Loss weight for GRN-based RNA reconstruction in Phase 1:warmup_1. set to 0.

weight_atac_phase2float

Loss weight for ATAC reconstruction in Phase 1:warmup_2.

weight_tf_phase2float

Loss weight for TF reconstruction in Phase 1:warmup_2.

weight_rna_phase2float

Loss weight for RNA reconstruction in Phase 1:warmup_2.

weight_rna_grn_phase2float

Loss weight for GRN-based RNA reconstruction in Phase 1:warmup_2. set to 0.

weight_atac_grnfloat

Loss weight for ATAC reconstruction in the GRN phase.

weight_tf_grnfloat

Loss weight for TF reconstruction in the GRN phase.

weight_rna_grnfloat

Loss weight for RNA reconstruction in the GRN phase.

weight_rna_from_grnfloat

Loss weight for the GRN-based RNA branch in the GRN phase.

l1_penalty_topic_tffloat

L1 regularization coefficient on the topic_tf_decoder.

l2_penalty_topic_tffloat

L2 regularization coefficient on the topic_tf_decoder.

l1_penalty_topic_peakfloat

L1 regularization coefficient on the topic_peak_decoder.

l2_penalty_topic_peakfloat

L2 regularization coefficient on the topic_peak_decoder.

l1_penalty_gene_peakfloat

L1 regularization coefficient on the gene_peak_factor_learnt.

l2_penalty_gene_peakfloat

L2 regularization coefficient on the gene_peak_factor_learnt.

l1_penalty_grn_activatorfloat

L1 regularization on GRN activator parameters (tf_gene_topic_activator_grn).

l1_penalty_grn_repressorfloat

L1 regularization on GRN repressor parameters (tf_gene_topic_repressor_grn).

tf_expression_modestr

Either “True” (use actual TF expression) or “latent” (model’s predicted TF expression).

tf_expression_clampfloat

Clamping threshold for TF expression values in [0, 1].

cells_per_topicint

Number of cells sampled per topic to compute topic-level TF expression.

weights_folder_scdoristr

Folder to save model weights after the scDoRI Phase 1.

weights_folder_grnstr

Folder to save model weights after the GRN phase.

best_scdori_model_pathstr

Filename for saving the best scDoRI model (Phase 1).

best_grn_model_pathstr

Filename for saving the best GRN model.

umap_n_neighborsint

Number of neighbors for UMAP.

umap_min_distfloat

Min dist parameter for UMAP.

umap_random_stateint

Random seed for UMAP.

significance_cutoffslist of float

List of thresholds for empirical p-value cutoffs in TF-gene link permutation tests.

num_permutationsint

Number of permutations used to compute TF-gene link significance.

scdori.models.initialize_scdori_parameters(model, gene_peak_distance_exp: Tensor, gene_peak_fixed: Tensor, insilico_act: Tensor, insilico_rep: Tensor, phase='warmup')[source]#

Initialize or freeze certain scDoRI parameters, preparing for either warmup or GRN phases.

Parameters#

modeltorch.nn.Module

An instance of the scDoRI model.

gene_peak_distance_exptorch.Tensor

Shape (num_genes, num_peaks). Peak-gene distance matrix, usually an exponential decay.

gene_peak_fixedtorch.Tensor

Shape (num_genes, num_peaks). A binary mask indicating allowable gene-peak links.

insilico_acttorch.Tensor

Shape (num_peaks, num_tfs). In silico ChIP-seq matrix for activators.

insilico_reptorch.Tensor

Shape (num_peaks, num_tfs). In silico ChIP-seq matrix for repressors.

phasestr, optional

“warmup” or “grn”. In “warmup”, sets gene-peak and TF-binding matrices, and keeps them fixed or partially trainable. In “grn”, enables TF-gene parameters to be trainable.

Returns#

None

Modifies model in place, setting appropriate .data values and .requires_grad booleans.

class scdori.models.scDoRI(device, num_genes, num_peaks, num_tfs, num_topics, num_batches, dim_encoder1, dim_encoder2, batch_norm=True)[source]#

Bases: Module

The scDoRI model integrates single cell multi-ome RNA and ATAC data to learn latent topic representations and perform gene regulatory network (GRN) inference.

This model contains: - Encoders for RNA and ATAC, producing a shared topic distribution. - Decoders for ATAC, TF, and RNA reconstruction. - GRN logic for combining TF binding data with gene-peak links and tf expression to reconstruct RNA profiles.

Parameters#

devicetorch.device

The device (CPU or CUDA) for PyTorch operations.

num_genesint

Number of genes in the RNA data.

num_peaksint

Number of peaks in the ATAC data.

num_tfsint

Number of transcription factors being modeled.

num_topicsint

Number of latent topics or factors.

num_batchesint

Number of distinct batches (for batch correction).

dim_encoder1int

Dimension of the first encoder layer.

dim_encoder2int

Dimension of the second encoder layer.

batch_normbool, optional

If True, use batch normalization in encoder and library factor MLPs. Default is True.

Attributes#

encoder_rnatorch.nn.Sequential

The neural network layers for the RNA encoder.

encoder_atactorch.nn.Sequential

The neural network layers for the ATAC encoder.

mu_thetatorch.nn.Linear

Linear layer converting combined RNA+ATAC encoder outputs into raw topic logits.

topic_peak_decodertorch.nn.Parameter

A (num_topics x num_peaks) parameter for ATAC reconstruction.

atac_batch_factortorch.nn.Parameter

A (num_batches x num_peaks) parameter for batch effects in ATAC.

atac_batch_normtorch.nn.BatchNorm1d

Batch normalization layer for ATAC predictions.

topic_tf_decodertorch.nn.Parameter

A (num_topics x num_tfs) parameter for TF expression reconstruction.

tf_batch_factortorch.nn.Parameter

A (num_batches x num_tfs) parameter for batch effects in TF reconstruction.

tf_batch_normtorch.nn.BatchNorm1d

Batch normalization layer for TF predictions.

tf_alpha_nbtorch.nn.Parameter

A (1 x num_tfs) parameter for TF negative binomial overdispersion.

gene_peak_factor_learnttorch.nn.Parameter

A (num_genes x num_peaks) learned matrix linking peaks to genes.

gene_peak_factor_fixedtorch.nn.Parameter

A (num_genes x num_peaks) fixed mask for feasible gene-peak links.

rna_batch_factortorch.nn.Parameter

A (num_batches x num_genes) parameter for batch effects in RNA reconstruction.

rna_batch_normtorch.nn.BatchNorm1d

Batch normalization layer for RNA predictions.

rna_alpha_nbtorch.nn.Parameter

A (1 x num_genes) parameter for RNA negative binomial overdispersion.

tf_library_factortorch.nn.Sequential

An MLP to predict library scaling factor for TF data from the observed TF expression.

rna_library_factortorch.nn.Sequential

An MLP to predict library scaling factor for RNA data from the observed gene counts.

tf_binding_matrix_activatortorch.nn.Parameter

A (num_peaks x num_tfs) matrix of in silico ChIP-seq (activator) TF-peak binding; precomputed and fixed.

tf_binding_matrix_repressortorch.nn.Parameter

A (num_peaks x num_tfs) matrix of in silico ChIP-seq (repressor) TF-peak binding; precomputed and fixed.

tf_gene_topic_activator_grntorch.nn.Parameter

A (num_topics x num_tfs x num_genes) matrix capturing per-topic activator regulation.

tf_gene_topic_repressor_grntorch.nn.Parameter

A (num_topics x num_tfs x num_genes) matrix capturing per-topic repressor regulation.

rna_grn_batch_factortorch.nn.Parameter

A (num_batches x num_genes) batch-effect parameter for the GRN-based RNA reconstruction (module 4).

rna_grn_batch_normtorch.nn.BatchNorm1d

Batch normalization layer for GRN-based RNA predictions.

encode(rna_input, atac_input, log_lib_rna, log_lib_atac, num_cells)[source]#

Encode RNA and ATAC input into a topic distribution (theta).

Parameters#

rna_inputtorch.Tensor

A (B, num_genes) tensor of RNA counts per cell.

atac_inputtorch.Tensor

A (B, num_peaks) tensor of ATAC counts per cell.

log_lib_rnatorch.Tensor

A (B, 1) tensor of log RNA library sizes.

log_lib_atactorch.Tensor

A (B, 1) tensor of log ATAC library sizes.

num_cellstorch.Tensor

A (B, 1) tensor representing how many cells are aggregated (if metacells), or all ones for single-cell data.

Returns#

(theta, mu_theta)tuple of torch.Tensor

theta : (B, num_topics), softmaxed topic distribution. mu_theta : (B, num_topics), raw topic logits.

forward(rna_input, atac_input, tf_input, topic_tf_input, log_lib_rna, log_lib_atac, num_cells, batch_onehot, phase='warmup_1')[source]#

Forward pass through scDoRI, producing predictions for ATAC, TF, and RNA reconstructions (Phase 1), as well as GRN-based RNA predictions in GRN phase (Phase 2).

Parameters#

rna_inputtorch.Tensor

Shape (B, num_genes). RNA counts per cell in the batch.

atac_inputtorch.Tensor

Shape (B, num_peaks). ATAC counts per cell in the batch.

tf_inputtorch.Tensor

Shape (B, num_tfs). Observed TF expression.

topic_tf_inputtorch.Tensor

Shape (num_topics, num_tfs). TF expression aggregated by topic, used only if phase == “grn”.

log_lib_rnatorch.Tensor

Shape (B, 1). Log of RNA library sizes.

log_lib_atactorch.Tensor

Shape (B, 1). Log of ATAC library sizes.

num_cellstorch.Tensor

Shape (B, 1). Number of cells aggregated (if metacells), else ones.

batch_onehottorch.Tensor

Shape (B, num_batches). One-hot batch encoding for each cell.

phasestr, optional

Which training phase: “warmup_1”, “warmup_2”, or “grn”. If phase==”grn”, the GRN-based RNA predictions are included.

Returns#

dict

A dictionary with the following keys: - “theta”: (B, num_topics), the softmaxed topic distribution. - “mu_theta”: (B, num_topics), raw topic logits. - “preds_atac”: (B, num_peaks), predicted peak accessibility. - “preds_tf”: (B, num_tfs), predicted TF expression. - “mu_nb_tf”: (B, num_tfs), TF negative binomial mean = preds_tf * TF library factor. - “preds_rna”: (B, num_genes), predicted RNA expression. - “mu_nb_rna”: (B, num_genes), RNA negative binomial mean = preds_rna * RNA library factor. - “preds_rna_from_grn”: (B, num_genes), optional GRN-based RNA predictions. - “mu_nb_rna_grn”: (B, num_genes), negative binomial mean of GRN-based RNA predictions. - “library_factor_tf”: (B, 1), predicted library factor for TF. - “library_factor_rna”: (B, 1), predicted library factor for RNA.

scdori.main.run_scdori_pipeline()[source]#

Run the scDoRI pipeline in three main phases: 1) ATAC+TF warmup (phase 1 warmup), 2) Add RNA (phase 1 full), 3) GRN training (phase 2).

Steps#

  1. Configure logging, set random seed, determine computing device.

  2. Load data: RNA/ATAC AnnData, gene-peak distances, in silico ChIP-seq embeddings.

  3. Split cells into train and eval sets, create DataLoaders.

  4. Build and initialize the scDoRI model: - The model is configured with the number of genes, peaks, TFs, and topics. - Initialize parameters (gene-peak, in silico matrices, etc.).

  5. Train phases 1 & 2 (integrated ATAC + TF, then add RNA).

  6. Save model weights.

  7. Re-initialize GRN-related parameters and run phase 3 (GRN training).

  8. Save final model weights for the GRN phase.

Returns#

None

The pipeline executes end-to-end training of the scDoRI model, saving intermediate and final weights to disk as specified in config.

Notes#

  • This function relies on configuration settings in config.py.

  • The pipeline uses train_scdori_phases for phases 1 & 2, and train_model_grn for the GRN phase.

  • Outputs (model weights) are saved to the paths specified by config.weights_folder_scdori and config.weights_folder_grn.

scdori.utils.log_nb_positive(x, mu, theta, eps: float = 1e-08, log_fn: callable = <built-in method log of type object>, lgamma_fn: callable = <built-in method lgamma of type object>)[source]#

Compute the log-likelihood for a Negative Binomial (NB) distribution.

This function is often used for modeling overdispersed count data in scRNA-seq .

Parameters#

xtorch.Tensor

Observed count data, shape (batch_size, num_features).

mutorch.Tensor

Mean of the negative binomial, must be > 0. Same shape as x.

thetatorch.Tensor

Inverse-dispersion (overdispersion) parameter, must be > 0. Same shape as x.

epsfloat, optional

A small constant for numerical stability in logarithms. Default is 1e-8.

log_fncallable, optional

A function to take the logarithm, typically torch.log. Default is torch.log.

lgamma_fncallable, optional

A function for computing log-gamma, typically torch.lgamma. Default is torch.lgamma.

Returns#

torch.Tensor

Element-wise log-likelihood of shape (batch_size, num_features).

scdori.utils.set_seed(seed=200)[source]#

Set the random seed for Python, NumPy, and PyTorch (including CUDA if available).

Parameters#

seedint, optional

The desired random seed. Default is 200.

Returns#

None

Modifies global states of Python, NumPy, and PyTorch seeds in place.

Notes#

Useful for ensuring reproducible results across runs when training or testing the model. However, full reproducibility can still be subject to GPU hardware determinism settings.

scdori.train_scdori.compute_eval_loss_scdori(model, device, eval_loader, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot, config_file)[source]#

Compute the validation loss for scDoRI.

Parameters#

modeltorch.nn.Module

The scDoRI model to evaluate.

devicetorch.device

The device (CPU or CUDA) used for PyTorch operations.

eval_loadertorch.utils.data.DataLoader

A DataLoader providing validation cell indices.

rna_anndataanndata.AnnData

RNA single-cell data in AnnData format.

atac_anndataanndata.AnnData

ATAC single-cell data in AnnData format.

num_cellsnp.ndarray

Number of cells per row (if metacells) or ones for single-cell data.

tf_indiceslist or np.ndarray

Indices of transcription factor genes in the RNA data.

encoding_batch_onehotnp.ndarray

One-hot encoding for batch information (cell x num_batches).

config_fileobject

Configuration object with hyperparameters (loss weights, penalties, etc.).

Returns#

tuple

(eval_loss, eval_loss_atac, eval_loss_tf, eval_loss_rna), each a float.

scdori.train_scdori.get_loss_weights_scdori(phase, config_file)[source]#

Get the loss weight dictionary for the specified phase.

Parameters#

phasestr

The phase of training, one of {“warmup_1”, “warmup_2”}.

config_fileobject

Configuration object containing attributes like weight_atac_phase1, weight_tf_phase1, weight_rna_phase1, etc.

Returns#

dict

A dictionary with keys {“atac”, “tf”, “rna”} indicating the respective loss weights.

scdori.train_scdori.get_phase_scdori(epoch, config_file)[source]#

Determine which training phase to use at a given epoch. In warmup_1, only module 1 and 3 (ATAC and TF reconstruction are trained), after which RNA construction from ATAC is added in warmup_2

Parameters#

epochint

The current training epoch.

config_fileobject

Configuration object that includes epoch_warmup_1 to define the cutoff for switching from phase “warmup_1” to “warmup_2”.

Returns#

str

The phase: “warmup_1” if epoch < config_file.epoch_warmup_1, else “warmup_2”.

scdori.train_scdori.train_scdori_phases(model, device, train_loader, eval_loader, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot, config_file)[source]#

Train the scDoRI model in two warmup phases: 1) Warmup Phase 1 (ATAC + TF focus). 2) Warmup Phase 2 (adding RNA).

Includes early stopping based on validation performance.

Parameters#

modeltorch.nn.Module

The scDoRI model to be trained.

devicetorch.device

The device (CPU or CUDA) for running PyTorch operations.

train_loadertorch.utils.data.DataLoader

DataLoader for the training set, providing cell indices.

eval_loadertorch.utils.data.DataLoader

DataLoader for the validation set, providing cell indices.

rna_anndataanndata.AnnData

RNA single-cell data in AnnData format.

atac_anndataanndata.AnnData

ATAC single-cell data in AnnData format.

num_cellsnp.ndarray

Number of cells per row (metacells) or ones for single-cell data.

tf_indiceslist or np.ndarray

Indices of transcription factor genes in the RNA data.

encoding_batch_onehotnp.ndarray

One-hot encoding matrix for batch information (cells x num_batches).

config_fileobject

Configuration with hyperparameters including: - learning_rate_scdori - max_scdori_epochs - epoch_warmup_1 - weight_atac_phase1, weight_tf_phase1, weight_rna_phase1 - weight_atac_phase2, weight_tf_phase2, weight_rna_phase2 - l1_penalty_topic_tf, etc. - eval_frequency - phase1_patience (early stopping patience for validation loss)

Returns#

torch.nn.Module

The trained scDoRI model after both warmup phases.

scdori.evaluation.get_latent_topics(model, device, data_loader, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot)[source]#

Extract the softmaxed topic activations (theta) for each cell in the dataset.

Parameters#

modeltorch.nn.Module

The scDoRI model containing an encoder for generating topic distributions.

devicetorch.device

The PyTorch device (e.g., ‘cpu’ or ‘cuda’) used for computations.

data_loadertorch.utils.data.DataLoader

A DataLoader that yields batches of cell indices.

rna_anndataanndata.AnnData

RNA single-cell data in AnnData format.

atac_anndataanndata.AnnData

ATAC single-cell data in AnnData format.

num_cellsnp.ndarray

Number of cells in each row (e.g., if using metacells). set to ones for single-cell data.

tf_indicesnp.ndarray

Indices of transcription factor genes in the RNA data.

encoding_batch_onehotnp.ndarray

One-hot encoding of batch information (cells x num_batches).

Returns#

np.ndarray

A 2D NumPy array of shape (n_cells, n_topics) representing the softmaxed topic activations for each cell in the order given by the DataLoader.

scdori.downstream.compute_activator_tf_activity_per_cell(grn_final, tf_names, latent_all_torch, selected_topics=None, clamp_value=1e-08, zscore=True)[source]#

Compute per-cell activity of activator TFs.

Parameters#

grn_finalnp.ndarray or torch.Tensor

Activator GRN of shape (num_topics, num_tfs, num_genes).

tf_nameslist of str

List of TF names, length = num_tfs.

latent_all_torchnp.ndarray or torch.Tensor

scDoRI latent topic activity of shape (num_cells, num_topics).

selected_topicslist of int, optional

Which topics to analyze. If None, all topics are used.

clamp_valuefloat, optional

Small constant to avoid division by zero. Default is 1e-8.

zscorebool, optional

If True, apply z-score normalization across cells in the final matrix. Default is True.

Returns#

np.ndarray

A (num_cells, num_tfs) array of TF activity values.

scdori.downstream.compute_atac_grn_activator_with_significance(model, device, cutoff_val, outdir)[source]#

Compute significant ATAC-derived TF–gene links for activators with permutation-based significance.

Uses only the learned peak-gene links and in silico ChIP-seq activator matrices. Significance is computed by permuting TF-binding profiles on peaks.

Parameters#

modeltorch.nn.Module

The trained model containing peak and TF decoders.

devicetorch.device

The device (CPU or CUDA) for PyTorch operations.

cutoff_valfloat

Significance cutoff (e.g., 0.95) for the percentile filtering.

outdirstr

Directory to save the computed GRN results.

Returns#

np.ndarray

A (num_topics, num_tfs, num_genes) array of significant ATAC-derived activator GRNs.

scdori.downstream.compute_atac_grn_repressor_with_significance(model, device, cutoff_val, outdir)[source]#

Compute significant ATAC-derived TF–gene links for repressors using permutation-based significance.

Uses the learned peak-gene links and in silico ChIP-seq repressor matrices. Significance is computed by permuting TF-binding profiles on peaks.

Parameters#

modeltorch.nn.Module

The trained model containing peak and TF decoders.

devicetorch.device

The device (CPU or CUDA) for PyTorch operations.

cutoff_valfloat

Significance cutoff (e.g., 0.05) for percentile filtering.

outdirstr

Directory to save the computed GRN results.

Returns#

np.ndarray

A (num_topics, num_tfs, num_genes) array of significant ATAC-derived repressor GRNs.

scdori.downstream.compute_neighbors_umap(rna_anndata, rep_key='X_scdori')[source]#

Compute neighbors and UMAP on the specified representation in an AnnData object.

Parameters#

rna_anndataanndata.AnnData

An AnnData object containing single-cell RNA data.

rep_keystr, optional

The key in rna_anndata.obsm that holds the latent representation used for computing UMAP. Default is “X_scdori”.

Returns#

None

Updates rna_anndata in place with neighbor graph and UMAP coordinates.

scdori.downstream.compute_repressor_tf_activity_per_cell(grn_final, tf_names, latent_all_torch, selected_topics=None, clamp_value=1e-08, zscore=True)[source]#

Compute per-cell activity of repressor TFs.

Parameters#

grn_finalnp.ndarray or torch.Tensor

Repressor GRN of shape (num_topics, num_tfs, num_genes).

tf_nameslist of str

List of TF names, length = num_tfs.

latent_all_torchnp.ndarray or torch.Tensor

scDoRI latent topic activity of shape (num_cells, num_topics).

selected_topicslist of int, optional

Which topics to analyze. If None, all topics are used.

clamp_valuefloat, optional

Small constant to avoid division by zero. Default is 1e-8.

zscorebool, optional

If True, apply z-score normalization across cells in the final matrix. Default is True.

Returns#

np.ndarray

A (num_cells, num_tfs) array of TF activity values.

scdori.downstream.compute_significant_grn(model, device, cutoff_val_activator, cutoff_val_repressor, tf_normalised, outdir)[source]#

Combine Significant ATAC-derived and scDoRI-learned GRN links into final activator and repressor GRNs.

Parameters#

modeltorch.nn.Module

The scDoRI model containing learned TF-gene topic parameters.

devicetorch.device

CPU or CUDA device for PyTorch operations.

cutoff_val_activatorfloat

Significance cutoff used for the activator GRN file.

cutoff_val_repressorfloat

Significance cutoff used for the repressor GRN file.

tf_normalisednp.ndarray or torch.Tensor

A (num_topics x num_tfs, 1) or (num_topics x num_tfs) matrix of normalized TF usage.

outdirstr

Directory containing the ATAC-based GRN files and to save computed results.

Returns#

tuple of np.ndarray
grn_actshape (num_topics, num_tfs, num_genes)

Computed activator GRN array.

grn_repshape (num_topics, num_tfs, num_genes)

Computed repressor GRN array.

Raises#

FileNotFoundError

If the required ATAC-derived GRN files are missing.

scdori.downstream.compute_topic_gene_matrix(model, device)[source]#

Compute a topic-gene matrix for downstream analysis (e.g., GSEA).

Steps#

  1. Apply softmax to model.topic_peak_decoder => (num_topics, num_peaks).

  2. Min-max normalize each peak across topics.

  3. Multiply by (gene_peak_factor_fixed * gene_peak_factor_learnt).

4. Then apply batch norm and softmax. 4. Get Topic Gene matrix (num_topics, num_genes)

Parameters#

modeltorch.nn.Module

The scDoRI model containing topic_peak_decoder and gene_peak_factor.

devicetorch.device

The device (CPU or CUDA) used for PyTorch operations.

Returns#

np.ndarray

A matrix of shape (num_topics, num_genes) representing topic-gene scores.

scdori.downstream.compute_topic_peak_umap(model, device)[source]#

Compute a UMAP embedding of the topic-peak decoder matrix. Each point on this embedding is a peak.

Steps#

  1. Apply softmax to model.topic_peak_decoder => (num_topics, num_peaks).

  2. Min-max normalize across topics.

  3. Transpose to get (num_peaks, num_topics).

  4. Run UMAP on the resulting matrix to get a (num_peaks, 2) embedding.

Parameters#

modeltorch.nn.Module

The scDoRI model containing the topic_peak_decoder.

devicetorch.device

The device (CPU or CUDA) used for PyTorch operations.

Returns#

tuple of (np.ndarray, np.ndarray)
embedding_peaksshape (num_peaks, 2)

The UMAP embedding of the peaks.

peak_matshape (num_peaks, num_topics)

The min-max normalized topic-peak matrix.

scdori.downstream.get_top_activators_per_topic(grn_final, tf_names, latent_all_torch, selected_topics=None, top_k=10, clamp_value=1e-08, zscore=True, figsize=(25, 10), out_fig=None)[source]#

Identify and plot top activator transcription factors per topic (Topic regulators, TRs).

Parameters#

grn_finalnp.ndarray or torch.Tensor

An array of shape (num_topics, num_tfs, num_genes), representing an activator GRN.

tf_nameslist of str

List of TF names, length = num_tfs.

latent_all_torchnp.ndarray or torch.Tensor

scDoRI latent topic activity of shape (num_cells, num_topics). Not always used, but can be referenced.

selected_topicslist of int, optional

Which topics to analyze. If None, all topics are used.

top_kint, optional

Number of top TFs to select per topic. Default is 10.

clamp_valuefloat, optional

Small cutoff to avoid division by zero. Default is 1e-8.

zscorebool, optional

If True, apply z-score normalization across topics in the final matrix. Default is True.

figsizetuple, optional

Size for the Seaborn clustermap. Default is (25, 10).

out_figstr or Path, optional

If provided, the figure is saved to this path; otherwise it is shown.

Returns#

tuple
df_topic_grnpd.DataFrame

The final DataFrame of shape (#topics, #TF).

selected_tflist of str

A sorted list of TFs used in the final clustermap.

scdori.downstream.get_top_repressor_per_topic(grn_final, tf_names, latent_all_torch, selected_topics=None, top_k=5, clamp_value=1e-08, zscore=True, figsize=(25, 10), out_fig=None)[source]#

Identify and plot top repressor transcription factors per topic.

Parameters#

grn_finalnp.ndarray or torch.Tensor

An array of shape (num_topics, num_tfs, num_genes), representing a repressor GRN.

tf_nameslist of str

List of TF names, length = num_tfs.

latent_all_torchnp.ndarray or torch.Tensor

scDoRI latent topic activity of shape (num_cells, num_topics).

selected_topicslist of int, optional

Which topics to analyze. If None, all topics are used.

top_kint, optional

Number of top TFs to select per topic. Default is 5.

clamp_valuefloat, optional

Small cutoff to avoid division by zero. Default is 1e-8.

zscorebool, optional

If True, apply z-score normalization across topics in the final matrix. Default is True.

figsizetuple, optional

Size for the Seaborn clustermap. Default is (25, 10).

out_figstr or Path, optional

If provided, the figure is saved to this path; otherwise it is shown.

Returns#

tuple
df_plotpd.DataFrame

The final DataFrame of shape (#topics, #TF).

selected_tflist of str

A sorted list of TFs used in the final clustermap.

scdori.downstream.load_best_model(model, best_model_path, device)[source]#

Load the best model weights from disk into the given model.

Parameters#

modeltorch.nn.Module

The model instance to which the weights will be loaded.

best_model_pathstr or Path

Path to the file containing the best model weights.

devicetorch.device

The device (CPU or CUDA) where the model will be moved.

Returns#

torch.nn.Module

The same model, now loaded with weights and set to eval mode.

Raises#

FileNotFoundError

If the specified best_model_path does not exist.

scdori.downstream.plot_topic_activation_heatmap(rna_anndata, groupby_key='celltype', aggregation='median')[source]#

Compute aggregated scDoRI latent topic activation across groups, then plot a clustermap.

Parameters#

rna_anndataanndata.AnnData

An AnnData object containing scDoRI latent factors in obsm[“X_scdori”].

groupby_keystr, optional

Column in rna_anndata.obs by which to group cells. Default is “celltype”.

aggregationstr, optional

Either “median” or “mean” for aggregating factor values per group. Default is “median”.

Returns#

pd.DataFrame

The transposed aggregated DataFrame (topics x groups).

Notes#

Uses a Seaborn clustermap to visualize the aggregated data.

scdori.downstream.save_regulons(grn_matrix, tf_names, gene_names, num_topics, output_dir, mode='activator')[source]#

Save regulons (TF-gene links across topics) for each TF based on a given GRN matrix.

Parameters#

grn_matrixnp.ndarray

A GRN matrix of shape (num_topics, num_tfs, num_genes).

tf_nameslist of str

List of transcription factor names, length = num_tfs.

gene_nameslist of str

List of gene names, length = num_genes.

num_topicsint

Number of topics in the GRN matrix.

output_dirstr

Directory where the regulon files will be saved.

modestr, optional

“activator” or “repressor”, used to name the output subdirectory/files.

Returns#

None

Saves individual TSV files for each TF in output_dir of shape (num_topics, num_genes), where non-zero values represent a link.

scdori.downstream.visualize_downstream_targets(rna_anndata, gene_list, score_name='target_score', layer='log')[source]#

Visualize the average expression of given genes on a UMAP embedding.

Uses scanpy.tl.score_genes to compute a gene score, then plots using scanpy.pl.umap.

Parameters#

rna_anndataanndata.AnnData

The AnnData object containing RNA data with .obsm[“X_umap”].

gene_listlist of str

A list of gene names to score.

score_namestr, optional

Name of the resulting gene score in rna_anndata.obs. Default is “target_score”.

layerstr, optional

Which layer to use if needed in score_genes. Default is “log”.

Returns#

None

Plots the UMAP colored by the computed gene score.

scdori.dataloader.create_minibatch(device, index_matrix, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot)[source]#

Create a minibatch of required input tensors using integer indices of cells.

Parameters#

devicetorch.device

The device (CPU or CUDA) to which the data should be moved.

index_matrixtorch.Tensor

A 1D tensor containing integer indices of the cells in the minibatch.

rna_anndataanndata.AnnData

AnnData object for RNA data. The .X matrix should contain RNA counts or expression values.

atac_anndataanndata.AnnData

AnnData object for ATAC data. The .X matrix should contain accessibility counts.

num_cellsnp.ndarray

A NumPy array (N x 1) indicating the number of cells represented by each row (if using metacells). For single-cell level data, this may be an array of ones.

tf_indicesnp.ndarray

Indices corresponding to transcription factors (TFs) in the RNA AnnData.

encoding_batch_onehotnp.ndarray

A one-hot encoded matrix representing batch information for each cell (cells x num_batches).

Returns#

tuple

A tuple containing: - input_matrix (torch.Tensor): Concatenated RNA and ATAC input of shape (B, g + p),

where B is batch size, g is the number of genes, p is the number of peaks. Values are floats on the given device.

  • tf_exp (torch.Tensor): RNA expression values for TFs, shape (B, num_tfs).

  • library_size_value (torch.Tensor): Log-scale library sizes for RNA and ATAC, shape (B, 2).

  • num_cells_value (torch.Tensor): Number of cells per row in the minibatch (B, 1).

  • input_batch (torch.Tensor): One-hot batch-encoding, shape (B, num_batches).

Notes#

  • This function converts sparse arrays to dense if necessary.

  • ATAC counts are converted from insertion counts to fragment counts by using (x + 1) // 2.

scdori.data_io.load_scdori_inputs(config_file)[source]#

Load RNA & ATAC data (.h5ad files), plus gene-peak distance and in silico chip-seq matrix.

Parameters#

config_fileobject

A configuration file containing the attributes: - data_dir : pathlib.Path

The base directory for input data.

  • output_subdirstr

    The subdirectory where output files are located.

  • rna_metacell_filestr

    The filename for the RNA data (single cell or metacell) (H5AD).

  • atac_metacell_filestr

    The filename for the ATAC data (single cell or metacell) (H5AD).

  • gene_peak_distance_filestr

    The filename for the NumPy array with gene-peak distance matrix.

  • insilico_chipseq_act_filestr

    The filename for the in silico ChIP-seq activator matrix.

  • insilico_chipseq_rep_filestr

    The filename for the in silico ChIP-seq repressor matrix.

Returns#

tuple

A tuple containing: rna_metacell : anndata.AnnData

RNA data loaded from H5AD.

atac_metacellanndata.AnnData

ATAC data loaded from H5AD.

gene_peak_disttorch.Tensor

A tensor of shape (num_genes, num_peaks) representing gene-peak distances.

insilico_acttorch.Tensor

A tensor of shape (num_peaks, num_motifs) for in silico ChIP-seq (activator) embeddings.

insilico_reptorch.Tensor

A tensor of shape (num_peaks, num_motifs) for in silico ChIP-seq (repressor) embeddings.

scdori.data_io.save_model_weights(model, path: Path, tag: str)[source]#

Save model weights to a specified path with a given tag.

Parameters#

modeltorch.nn.Module

The PyTorch model whose state_dict is to be saved.

pathpathlib.Path

The directory path where the weights file will be saved.

tagstr

An identifier to include in the saved filename (e.g., “best_eval”).

Returns#

None