scDORI training#

scdori.train_grn.compute_eval_loss_grn(model, device, train_loader, eval_loader, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot, config_file)[source]#

Compute the validation (evaluation) loss for the GRN phase.

This function evaluates loss components for ATAC, TF, RNA, and RNA-from-GRN on a validation dataset.

Parameters#

modeltorch.nn.Module: The scDoRI model.
devicetorch.device: The device (CPU or CUDA) used for PyTorch tensors.
train_loaderDataLoader: DataLoader for the training set (used to compute TF expression).
eval_loaderDataLoader: DataLoader for the validation set.
rna_anndataanndata.AnnData: RNA single-cell data in AnnData format.
atac_anndataanndata.AnnData: ATAC single-cell data in AnnData format.
num_cellsnp.ndarray: number of cells constituting each input metacell, set to 1 for single cell data
tf_indiceslist of int: Indices of TF features in the RNA data.
encoding_batch_onehotnp.ndarray: One-hot encoding for batch information.
config_filepython file: Configuration file for model training.

Returns#

tuple of float: A tuple containing: (eval_loss, eval_loss_atac, eval_loss_tf, eval_loss_rna, eval_loss_rna_grn).

scdori.train_grn.get_tf_expression(tf_expression_mode, model, device, train_loader, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot, config_file)[source]#

Compute TF expression per topic.

If tf_expression_mode is “True”, this function computes the mean TF expression for the top-k cells in each topic. Otherwise, it uses a normalized topic-TF decoder matrix from the model.

Parameters#

tf_expression_modestr: Mode for TF expression. “True” calculates per-topic TF expression from top-k cells, “latent” uses the topic-TF decoder matrix.
modeltorch.nn.Module: The scDoRI model containing encoder and decoder modules.
devicetorch.device: The device (CPU or CUDA) used for PyTorch tensors.
train_loaderDataLoader: DataLoader for training data.
rna_anndataanndata.AnnData: RNA single-cell data in AnnData format.
atac_anndataanndata.AnnData: ATAC single-cell data in AnnData format.
num_cellsnp.ndarray: number of cells constituting each input metacell, set to 1 for single cell data.
tf_indiceslist of int: Indices of TF features in the RNA data.
encoding_batch_onehotnp.ndarray: One-hot encoding for batch information.
config_filepython file: Configuration object with model training.

Returns#

torch.Tensor: A (num_topics x num_tfs) tensor of TF expression values for each topic.

scdori.train_grn.set_encoder_frozen(model, freeze=True)[source]#

Freeze or unfreeze the encoder parameters.

Parameters#

modeltorch.nn.Module: scDoRI model containing the encoder modules.
freezebool, optional: If True, freeze the encoder parameters; if False, unfreeze them. Default is True.

scdori.train_grn.set_peak_gene_frozen(model, freeze=True)[source]#

Freeze or unfreeze the peak-gene link parameters.

Parameters#

modeltorch.nn.Module: scDoRI model containing the peak-gene factor.
freezebool, optional: If True, freeze the peak-gene parameters; if False, unfreeze them. Default is True.

scdori.train_grn.set_topic_peak_frozen(model, freeze=True)[source]#

Freeze or unfreeze the topic-peak decoder parameters.

Parameters#

modeltorch.nn.Module: scDoRI model containing the topic-peak decoder.
freezebool, optional: If True, freeze the topic-peak decoder; if False, unfreeze it. Default is True.

scdori.train_grn.set_topic_tf_frozen(model, freeze=True)[source]#

Freeze or unfreeze the topic-TF decoder parameters.

Parameters#

modeltorch.nn.Module: scDoRI model containing the topic-TF decoder.
freezebool, optional: If True, freeze the topic-TF decoder; if False, unfreeze it. Default is True.

scdori.train_grn.train_model_grn(model, device, train_loader, eval_loader, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot, config_file)[source]#

Train the model in Phase 2 (GRN phase).

In this phase, the model focuses on learning activator and repressor TF-gene links per topic (module 4 of scDoRI). Other modules of the model can be optionally frozen or unfrozen based on the configuration.

Parameters#

modeltorch.nn.Module: The scDoRI model to train.
devicetorch.device: The device (CPU or CUDA) used for PyTorch tensors.
train_loaderDataLoader: DataLoader for the training set.
eval_loaderDataLoader: DataLoader for the validation set, used to check early stopping criteria.
rna_anndataanndata.AnnData: RNA single-cell data in AnnData format.
atac_anndataanndata.AnnData: ATAC single-cell data in AnnData format.
num_cellsnp.ndarray: number of cells constituting each input metacell, set to 1 for single cell data
tf_indiceslist of int: Indices of TF features in the RNA data.
encoding_batch_onehotnp.ndarray: One-hot encoding for batch information.
config_filepython file: Configuration file for model training.

Returns#

torch.nn.Module: The trained model after the GRN phase completes or early stopping occurs.

Global configuration for the scDoRI modeling pipeline.

This file defines top-level constants and parameters controlling:

Logging
File paths for data and outputs
Model architecture details (numbers of topics, hidden dimensions)
Training phases and hyperparameters
Loss weighting for different data modalities (ATAC, TF, RNA)
Regularization and early stopping settings
Significance testing cutoffs for TF-gene links
UMAP parameters for visualization

Attributes#

logging_levelint: The Python logging level (e.g., logging.INFO).
data_dirPath: Base directory containing the processed anndata and other precomputed files.
output_subdirstr: Subdirectory name within data_dir for storing or accessing outputs.
rna_metacell_filestr: Name of the H5AD file containing RNA data.
atac_metacell_filestr: Name of the H5AD file containing ATAC data.
batch_colstr: Key for the batch column in the AnnData object.
gene_peak_distance_filestr: Filename for the NumPy array containing gene-peak distances.
insilico_chipseq_act_filestr: Filename for the in silico ChIP-seq activator embeddings.
insilico_chipseq_rep_filestr: Filename for the in silico ChIP-seq repressor embeddings.
random_seedint: Random seed for reproducibility.
batch_size_cellint: Batch size for training.
dim_encoder1int: Dimension of the first encoder layer.
dim_encoder2int: Dimension of the second encoder layer.
num_topicsint: Number of latent topics for scDoRI.
batch_size_cell_predictionint: Batch size when making predictions in eval mode (e.g., forward passes only).
epoch_warmup_1int: Number of epochs to run “warmup_1” (ATAC+TF) before adding RNA.
max_scdori_epochsint: Maximum number of epochs for the scDoRI phase 1 training (module 1,2,3).
max_grn_epochsint: Maximum number of epochs for the GRN training phase (module 4).
update_encoder_in_grnbool: Whether to unfreeze the encoder during the GRN phase.
update_peak_gene_in_grnbool: Whether to unfreeze the peak-gene links in the GRN phase.
update_topic_peak_in_grnbool: Whether to unfreeze the topic-peak links in the GRN phase.
update_topic_tf_in_grnbool: Whether to unfreeze the topic-TF links in the GRN phase.
eval_frequencyint: How often (in epochs) to evaluate validation loss.
phase1_patienceint: Early stopping patience (in epochs) for phase 1 training (module 1,2 3).
grn_val_patienceint: Early stopping patience (in epochs) for the GRN phase.
learning_rate_scdorifloat: Learning rate for scDoRI phase 1 training (module 1,2 3).
learning_rate_grnfloat: Learning rate for the GRN training phase.
weight_atac_phase1float: Loss weight for ATAC reconstruction in Phase 1:warmup_1.
weight_tf_phase1float: Loss weight for TF reconstruction in Phase 1:warmup_1.
weight_rna_phase1float: Loss weight for RNA reconstruction in Phase 1:warmup_1. set to 0.
weight_rna_grn_phase1float: Loss weight for GRN-based RNA reconstruction in Phase 1:warmup_1. set to 0.
weight_atac_phase2float: Loss weight for ATAC reconstruction in Phase 1:warmup_2.
weight_tf_phase2float: Loss weight for TF reconstruction in Phase 1:warmup_2.
weight_rna_phase2float: Loss weight for RNA reconstruction in Phase 1:warmup_2.
weight_rna_grn_phase2float: Loss weight for GRN-based RNA reconstruction in Phase 1:warmup_2. set to 0.
weight_atac_grnfloat: Loss weight for ATAC reconstruction in the GRN phase.
weight_tf_grnfloat: Loss weight for TF reconstruction in the GRN phase.
weight_rna_grnfloat: Loss weight for RNA reconstruction in the GRN phase.
weight_rna_from_grnfloat: Loss weight for the GRN-based RNA branch in the GRN phase.
l1_penalty_topic_tffloat: L1 regularization coefficient on the topic_tf_decoder.
l2_penalty_topic_tffloat: L2 regularization coefficient on the topic_tf_decoder.
l1_penalty_topic_peakfloat: L1 regularization coefficient on the topic_peak_decoder.
l2_penalty_topic_peakfloat: L2 regularization coefficient on the topic_peak_decoder.
l1_penalty_gene_peakfloat: L1 regularization coefficient on the gene_peak_factor_learnt.
l2_penalty_gene_peakfloat: L2 regularization coefficient on the gene_peak_factor_learnt.
l1_penalty_grn_activatorfloat: L1 regularization on GRN activator parameters (tf_gene_topic_activator_grn).
l1_penalty_grn_repressorfloat: L1 regularization on GRN repressor parameters (tf_gene_topic_repressor_grn).
tf_expression_modestr: Either “True” (use actual TF expression) or “latent” (model’s predicted TF expression).
tf_expression_clampfloat: Clamping threshold for TF expression values in [0, 1].
cells_per_topicint: Number of cells sampled per topic to compute topic-level TF expression.
weights_folder_scdoristr: Folder to save model weights after the scDoRI Phase 1.
weights_folder_grnstr: Folder to save model weights after the GRN phase.
best_scdori_model_pathstr: Filename for saving the best scDoRI model (Phase 1).
best_grn_model_pathstr: Filename for saving the best GRN model.
umap_n_neighborsint: Number of neighbors for UMAP.
umap_min_distfloat: Min dist parameter for UMAP.
umap_random_stateint: Random seed for UMAP.
significance_cutoffslist of float: List of thresholds for empirical p-value cutoffs in TF-gene link permutation tests.
num_permutationsint: Number of permutations used to compute TF-gene link significance.

scdori.models.initialize_scdori_parameters(model, gene_peak_distance_exp: Tensor, gene_peak_fixed: Tensor, insilico_act: Tensor, insilico_rep: Tensor, phase='warmup')[source]#

Initialize or freeze certain scDoRI parameters, preparing for either warmup or GRN phases.

Parameters#

modeltorch.nn.Module: An instance of the scDoRI model.
gene_peak_distance_exptorch.Tensor: Shape (num_genes, num_peaks). Peak-gene distance matrix, usually an exponential decay.
gene_peak_fixedtorch.Tensor: Shape (num_genes, num_peaks). A binary mask indicating allowable gene-peak links.
insilico_acttorch.Tensor: Shape (num_peaks, num_tfs). In silico ChIP-seq matrix for activators.
insilico_reptorch.Tensor: Shape (num_peaks, num_tfs). In silico ChIP-seq matrix for repressors.
phasestr, optional: “warmup” or “grn”. In “warmup”, sets gene-peak and TF-binding matrices, and keeps them fixed or partially trainable. In “grn”, enables TF-gene parameters to be trainable.

Returns#

None: Modifies model in place, setting appropriate .data values and .requires_grad booleans.

class scdori.models.scDoRI(device, num_genes, num_peaks, num_tfs, num_topics, num_batches, dim_encoder1, dim_encoder2, batch_norm=True)[source]#

Bases: Module

The scDoRI model integrates single cell multi-ome RNA and ATAC data to learn latent topic representations and perform gene regulatory network (GRN) inference.

This model contains: - Encoders for RNA and ATAC, producing a shared topic distribution. - Decoders for ATAC, TF, and RNA reconstruction. - GRN logic for combining TF binding data with gene-peak links and tf expression to reconstruct RNA profiles.

Parameters#

devicetorch.device: The device (CPU or CUDA) for PyTorch operations.
num_genesint: Number of genes in the RNA data.
num_peaksint: Number of peaks in the ATAC data.
num_tfsint: Number of transcription factors being modeled.
num_topicsint: Number of latent topics or factors.
num_batchesint: Number of distinct batches (for batch correction).
dim_encoder1int: Dimension of the first encoder layer.
dim_encoder2int: Dimension of the second encoder layer.
batch_normbool, optional: If True, use batch normalization in encoder and library factor MLPs. Default is True.

Attributes#

encoder_rnatorch.nn.Sequential: The neural network layers for the RNA encoder.
encoder_atactorch.nn.Sequential: The neural network layers for the ATAC encoder.
mu_thetatorch.nn.Linear: Linear layer converting combined RNA+ATAC encoder outputs into raw topic logits.
topic_peak_decodertorch.nn.Parameter: A (num_topics x num_peaks) parameter for ATAC reconstruction.
atac_batch_factortorch.nn.Parameter: A (num_batches x num_peaks) parameter for batch effects in ATAC.
atac_batch_normtorch.nn.BatchNorm1d: Batch normalization layer for ATAC predictions.
topic_tf_decodertorch.nn.Parameter: A (num_topics x num_tfs) parameter for TF expression reconstruction.
tf_batch_factortorch.nn.Parameter: A (num_batches x num_tfs) parameter for batch effects in TF reconstruction.
tf_batch_normtorch.nn.BatchNorm1d: Batch normalization layer for TF predictions.
tf_alpha_nbtorch.nn.Parameter: A (1 x num_tfs) parameter for TF negative binomial overdispersion.
gene_peak_factor_learnttorch.nn.Parameter: A (num_genes x num_peaks) learned matrix linking peaks to genes.
gene_peak_factor_fixedtorch.nn.Parameter: A (num_genes x num_peaks) fixed mask for feasible gene-peak links.
rna_batch_factortorch.nn.Parameter: A (num_batches x num_genes) parameter for batch effects in RNA reconstruction.
rna_batch_normtorch.nn.BatchNorm1d: Batch normalization layer for RNA predictions.
rna_alpha_nbtorch.nn.Parameter: A (1 x num_genes) parameter for RNA negative binomial overdispersion.
tf_library_factortorch.nn.Sequential: An MLP to predict library scaling factor for TF data from the observed TF expression.
rna_library_factortorch.nn.Sequential: An MLP to predict library scaling factor for RNA data from the observed gene counts.
tf_binding_matrix_activatortorch.nn.Parameter: A (num_peaks x num_tfs) matrix of in silico ChIP-seq (activator) TF-peak binding; precomputed and fixed.
tf_binding_matrix_repressortorch.nn.Parameter: A (num_peaks x num_tfs) matrix of in silico ChIP-seq (repressor) TF-peak binding; precomputed and fixed.
tf_gene_topic_activator_grntorch.nn.Parameter: A (num_topics x num_tfs x num_genes) matrix capturing per-topic activator regulation.
tf_gene_topic_repressor_grntorch.nn.Parameter: A (num_topics x num_tfs x num_genes) matrix capturing per-topic repressor regulation.
rna_grn_batch_factortorch.nn.Parameter: A (num_batches x num_genes) batch-effect parameter for the GRN-based RNA reconstruction (module 4).
rna_grn_batch_normtorch.nn.BatchNorm1d: Batch normalization layer for GRN-based RNA predictions.

encode(rna_input, atac_input, log_lib_rna, log_lib_atac, num_cells)[source]#

Encode RNA and ATAC input into a topic distribution (theta).

Parameters#

rna_inputtorch.Tensor: A (B, num_genes) tensor of RNA counts per cell.
atac_inputtorch.Tensor: A (B, num_peaks) tensor of ATAC counts per cell.
log_lib_rnatorch.Tensor: A (B, 1) tensor of log RNA library sizes.
log_lib_atactorch.Tensor: A (B, 1) tensor of log ATAC library sizes.
num_cellstorch.Tensor: A (B, 1) tensor representing how many cells are aggregated (if metacells), or all ones for single-cell data.

Returns#

(theta, mu_theta)tuple of torch.Tensor: theta : (B, num_topics), softmaxed topic distribution. mu_theta : (B, num_topics), raw topic logits.

forward(rna_input, atac_input, tf_input, topic_tf_input, log_lib_rna, log_lib_atac, num_cells, batch_onehot, phase='warmup_1')[source]#

Forward pass through scDoRI, producing predictions for ATAC, TF, and RNA reconstructions (Phase 1), as well as GRN-based RNA predictions in GRN phase (Phase 2).

Parameters#

rna_inputtorch.Tensor: Shape (B, num_genes). RNA counts per cell in the batch.
atac_inputtorch.Tensor: Shape (B, num_peaks). ATAC counts per cell in the batch.
tf_inputtorch.Tensor: Shape (B, num_tfs). Observed TF expression.
topic_tf_inputtorch.Tensor: Shape (num_topics, num_tfs). TF expression aggregated by topic, used only if phase == “grn”.
log_lib_rnatorch.Tensor: Shape (B, 1). Log of RNA library sizes.
log_lib_atactorch.Tensor: Shape (B, 1). Log of ATAC library sizes.
num_cellstorch.Tensor: Shape (B, 1). Number of cells aggregated (if metacells), else ones.
batch_onehottorch.Tensor: Shape (B, num_batches). One-hot batch encoding for each cell.
phasestr, optional: Which training phase: “warmup_1”, “warmup_2”, or “grn”. If phase==”grn”, the GRN-based RNA predictions are included.

Returns#

dict: A dictionary with the following keys: - “theta”: (B, num_topics), the softmaxed topic distribution. - “mu_theta”: (B, num_topics), raw topic logits. - “preds_atac”: (B, num_peaks), predicted peak accessibility. - “preds_tf”: (B, num_tfs), predicted TF expression. - “mu_nb_tf”: (B, num_tfs), TF negative binomial mean = preds_tf * TF library factor. - “preds_rna”: (B, num_genes), predicted RNA expression. - “mu_nb_rna”: (B, num_genes), RNA negative binomial mean = preds_rna * RNA library factor. - “preds_rna_from_grn”: (B, num_genes), optional GRN-based RNA predictions. - “mu_nb_rna_grn”: (B, num_genes), negative binomial mean of GRN-based RNA predictions. - “library_factor_tf”: (B, 1), predicted library factor for TF. - “library_factor_rna”: (B, 1), predicted library factor for RNA.

scdori.main.run_scdori_pipeline()[source]#

Run the scDoRI pipeline in three main phases: 1) ATAC+TF warmup (phase 1 warmup), 2) Add RNA (phase 1 full), 3) GRN training (phase 2).

Steps#

Configure logging, set random seed, determine computing device.
Load data: RNA/ATAC AnnData, gene-peak distances, in silico ChIP-seq embeddings.
Split cells into train and eval sets, create DataLoaders.
Build and initialize the scDoRI model: - The model is configured with the number of genes, peaks, TFs, and topics. - Initialize parameters (gene-peak, in silico matrices, etc.).
Train phases 1 & 2 (integrated ATAC + TF, then add RNA).
Save model weights.
Re-initialize GRN-related parameters and run phase 3 (GRN training).
Save final model weights for the GRN phase.

Returns#

None: The pipeline executes end-to-end training of the scDoRI model, saving intermediate and final weights to disk as specified in config.

Notes#

This function relies on configuration settings in config.py.
The pipeline uses train_scdori_phases for phases 1 & 2, and train_model_grn for the GRN phase.
Outputs (model weights) are saved to the paths specified by config.weights_folder_scdori and config.weights_folder_grn.

scdori.utils.log_nb_positive(x, mu, theta, eps: float = 1e-08, log_fn: callable = <built-in method log of type object>, lgamma_fn: callable = <built-in method lgamma of type object>)[source]#

Compute the log-likelihood for a Negative Binomial (NB) distribution.

This function is often used for modeling overdispersed count data in scRNA-seq .

Parameters#

xtorch.Tensor: Observed count data, shape (batch_size, num_features).
mutorch.Tensor: Mean of the negative binomial, must be > 0. Same shape as x.
thetatorch.Tensor: Inverse-dispersion (overdispersion) parameter, must be > 0. Same shape as x.
epsfloat, optional: A small constant for numerical stability in logarithms. Default is 1e-8.
log_fncallable, optional: A function to take the logarithm, typically torch.log. Default is torch.log.
lgamma_fncallable, optional: A function for computing log-gamma, typically torch.lgamma. Default is torch.lgamma.

Returns#

torch.Tensor: Element-wise log-likelihood of shape (batch_size, num_features).

scdori.utils.set_seed(seed=200)[source]#

Set the random seed for Python, NumPy, and PyTorch (including CUDA if available).

Parameters#

seedint, optional: The desired random seed. Default is 200.

Returns#

None: Modifies global states of Python, NumPy, and PyTorch seeds in place.

Notes#

Useful for ensuring reproducible results across runs when training or testing the model. However, full reproducibility can still be subject to GPU hardware determinism settings.

scdori.train_scdori.compute_eval_loss_scdori(model, device, eval_loader, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot, config_file)[source]#

Compute the validation loss for scDoRI.

Parameters#

modeltorch.nn.Module: The scDoRI model to evaluate.
devicetorch.device: The device (CPU or CUDA) used for PyTorch operations.
eval_loadertorch.utils.data.DataLoader: A DataLoader providing validation cell indices.
rna_anndataanndata.AnnData: RNA single-cell data in AnnData format.
atac_anndataanndata.AnnData: ATAC single-cell data in AnnData format.
num_cellsnp.ndarray: Number of cells per row (if metacells) or ones for single-cell data.
tf_indiceslist or np.ndarray: Indices of transcription factor genes in the RNA data.
encoding_batch_onehotnp.ndarray: One-hot encoding for batch information (cell x num_batches).
config_fileobject: Configuration object with hyperparameters (loss weights, penalties, etc.).

Returns#

tuple: (eval_loss, eval_loss_atac, eval_loss_tf, eval_loss_rna), each a float.

scdori.train_scdori.get_loss_weights_scdori(phase, config_file)[source]#

Get the loss weight dictionary for the specified phase.

Parameters#

phasestr: The phase of training, one of {“warmup_1”, “warmup_2”}.
config_fileobject: Configuration object containing attributes like weight_atac_phase1, weight_tf_phase1, weight_rna_phase1, etc.

Returns#

dict: A dictionary with keys {“atac”, “tf”, “rna”} indicating the respective loss weights.

scdori.train_scdori.get_phase_scdori(epoch, config_file)[source]#

Determine which training phase to use at a given epoch. In warmup_1, only module 1 and 3 (ATAC and TF reconstruction are trained), after which RNA construction from ATAC is added in warmup_2

Parameters#

epochint: The current training epoch.
config_fileobject: Configuration object that includes epoch_warmup_1 to define the cutoff for switching from phase “warmup_1” to “warmup_2”.

Returns#

str: The phase: “warmup_1” if epoch < config_file.epoch_warmup_1, else “warmup_2”.

scdori.train_scdori.train_scdori_phases(model, device, train_loader, eval_loader, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot, config_file)[source]#

Train the scDoRI model in two warmup phases: 1) Warmup Phase 1 (ATAC + TF focus). 2) Warmup Phase 2 (adding RNA).

Includes early stopping based on validation performance.

Parameters#

modeltorch.nn.Module: The scDoRI model to be trained.
devicetorch.device: The device (CPU or CUDA) for running PyTorch operations.
train_loadertorch.utils.data.DataLoader: DataLoader for the training set, providing cell indices.
eval_loadertorch.utils.data.DataLoader: DataLoader for the validation set, providing cell indices.
rna_anndataanndata.AnnData: RNA single-cell data in AnnData format.
atac_anndataanndata.AnnData: ATAC single-cell data in AnnData format.
num_cellsnp.ndarray: Number of cells per row (metacells) or ones for single-cell data.
tf_indiceslist or np.ndarray: Indices of transcription factor genes in the RNA data.
encoding_batch_onehotnp.ndarray: One-hot encoding matrix for batch information (cells x num_batches).
config_fileobject: Configuration with hyperparameters including: - learning_rate_scdori - max_scdori_epochs - epoch_warmup_1 - weight_atac_phase1, weight_tf_phase1, weight_rna_phase1 - weight_atac_phase2, weight_tf_phase2, weight_rna_phase2 - l1_penalty_topic_tf, etc. - eval_frequency - phase1_patience (early stopping patience for validation loss)

Returns#

torch.nn.Module: The trained scDoRI model after both warmup phases.

scdori.evaluation.get_latent_topics(model, device, data_loader, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot)[source]#

Extract the softmaxed topic activations (theta) for each cell in the dataset.

Parameters#

modeltorch.nn.Module: The scDoRI model containing an encoder for generating topic distributions.
devicetorch.device: The PyTorch device (e.g., ‘cpu’ or ‘cuda’) used for computations.
data_loadertorch.utils.data.DataLoader: A DataLoader that yields batches of cell indices.
rna_anndataanndata.AnnData: RNA single-cell data in AnnData format.
atac_anndataanndata.AnnData: ATAC single-cell data in AnnData format.
num_cellsnp.ndarray: Number of cells in each row (e.g., if using metacells). set to ones for single-cell data.
tf_indicesnp.ndarray: Indices of transcription factor genes in the RNA data.
encoding_batch_onehotnp.ndarray: One-hot encoding of batch information (cells x num_batches).

Returns#

np.ndarray: A 2D NumPy array of shape (n_cells, n_topics) representing the softmaxed topic activations for each cell in the order given by the DataLoader.

scdori.downstream.compute_activator_tf_activity_per_cell(grn_final, tf_names, latent_all_torch, selected_topics=None, clamp_value=1e-08, zscore=True)[source]#

Compute per-cell activity of activator TFs.

Parameters#

grn_finalnp.ndarray or torch.Tensor: Activator GRN of shape (num_topics, num_tfs, num_genes).
tf_nameslist of str: List of TF names, length = num_tfs.
latent_all_torchnp.ndarray or torch.Tensor: scDoRI latent topic activity of shape (num_cells, num_topics).
selected_topicslist of int, optional: Which topics to analyze. If None, all topics are used.
clamp_valuefloat, optional: Small constant to avoid division by zero. Default is 1e-8.
zscorebool, optional: If True, apply z-score normalization across cells in the final matrix. Default is True.

Returns#

np.ndarray: A (num_cells, num_tfs) array of TF activity values.

scdori.downstream.compute_atac_grn_activator_with_significance(model, device, cutoff_val, outdir)[source]#

Compute significant ATAC-derived TF–gene links for activators with permutation-based significance.

Uses only the learned peak-gene links and in silico ChIP-seq activator matrices. Significance is computed by permuting TF-binding profiles on peaks.

Parameters#

modeltorch.nn.Module: The trained model containing peak and TF decoders.
devicetorch.device: The device (CPU or CUDA) for PyTorch operations.
cutoff_valfloat: Significance cutoff (e.g., 0.95) for the percentile filtering.
outdirstr: Directory to save the computed GRN results.

Returns#

np.ndarray: A (num_topics, num_tfs, num_genes) array of significant ATAC-derived activator GRNs.

scdori.downstream.compute_atac_grn_repressor_with_significance(model, device, cutoff_val, outdir)[source]#

Compute significant ATAC-derived TF–gene links for repressors using permutation-based significance.

Uses the learned peak-gene links and in silico ChIP-seq repressor matrices. Significance is computed by permuting TF-binding profiles on peaks.

Parameters#

modeltorch.nn.Module: The trained model containing peak and TF decoders.
devicetorch.device: The device (CPU or CUDA) for PyTorch operations.
cutoff_valfloat: Significance cutoff (e.g., 0.05) for percentile filtering.
outdirstr: Directory to save the computed GRN results.

Returns#

np.ndarray: A (num_topics, num_tfs, num_genes) array of significant ATAC-derived repressor GRNs.

scdori.downstream.compute_neighbors_umap(rna_anndata, rep_key='X_scdori')[source]#

Compute neighbors and UMAP on the specified representation in an AnnData object.

Parameters#

rna_anndataanndata.AnnData: An AnnData object containing single-cell RNA data.
rep_keystr, optional: The key in rna_anndata.obsm that holds the latent representation used for computing UMAP. Default is “X_scdori”.

Returns#

None: Updates rna_anndata in place with neighbor graph and UMAP coordinates.

scdori.downstream.compute_repressor_tf_activity_per_cell(grn_final, tf_names, latent_all_torch, selected_topics=None, clamp_value=1e-08, zscore=True)[source]#

Compute per-cell activity of repressor TFs.

Parameters#

grn_finalnp.ndarray or torch.Tensor: Repressor GRN of shape (num_topics, num_tfs, num_genes).
tf_nameslist of str: List of TF names, length = num_tfs.
latent_all_torchnp.ndarray or torch.Tensor: scDoRI latent topic activity of shape (num_cells, num_topics).
selected_topicslist of int, optional: Which topics to analyze. If None, all topics are used.
clamp_valuefloat, optional: Small constant to avoid division by zero. Default is 1e-8.
zscorebool, optional: If True, apply z-score normalization across cells in the final matrix. Default is True.

Returns#

np.ndarray: A (num_cells, num_tfs) array of TF activity values.

scdori.downstream.compute_significant_grn(model, device, cutoff_val_activator, cutoff_val_repressor, tf_normalised, outdir)[source]#

Combine Significant ATAC-derived and scDoRI-learned GRN links into final activator and repressor GRNs.

Parameters#

modeltorch.nn.Module: The scDoRI model containing learned TF-gene topic parameters.
devicetorch.device: CPU or CUDA device for PyTorch operations.
cutoff_val_activatorfloat: Significance cutoff used for the activator GRN file.
cutoff_val_repressorfloat: Significance cutoff used for the repressor GRN file.
tf_normalisednp.ndarray or torch.Tensor: A (num_topics x num_tfs, 1) or (num_topics x num_tfs) matrix of normalized TF usage.
outdirstr: Directory containing the ATAC-based GRN files and to save computed results.

Returns#

tuple of np.ndarray

grn_actshape (num_topics, num_tfs, num_genes): Computed activator GRN array.
grn_repshape (num_topics, num_tfs, num_genes): Computed repressor GRN array.

Raises#

FileNotFoundError: If the required ATAC-derived GRN files are missing.

scdori.downstream.compute_topic_gene_matrix(model, device)[source]#

Compute a topic-gene matrix for downstream analysis (e.g., GSEA).

Steps#

Apply softmax to model.topic_peak_decoder => (num_topics, num_peaks).
Min-max normalize each peak across topics.
Multiply by (gene_peak_factor_fixed * gene_peak_factor_learnt).

4. Then apply batch norm and softmax. 4. Get Topic Gene matrix (num_topics, num_genes)

Parameters#

modeltorch.nn.Module: The scDoRI model containing topic_peak_decoder and gene_peak_factor.
devicetorch.device: The device (CPU or CUDA) used for PyTorch operations.

Returns#

np.ndarray: A matrix of shape (num_topics, num_genes) representing topic-gene scores.

scdori.downstream.compute_topic_peak_umap(model, device)[source]#

Compute a UMAP embedding of the topic-peak decoder matrix. Each point on this embedding is a peak.

Steps#

Apply softmax to model.topic_peak_decoder => (num_topics, num_peaks).
Min-max normalize across topics.
Transpose to get (num_peaks, num_topics).
Run UMAP on the resulting matrix to get a (num_peaks, 2) embedding.

Parameters#

modeltorch.nn.Module: The scDoRI model containing the topic_peak_decoder.
devicetorch.device: The device (CPU or CUDA) used for PyTorch operations.

Returns#

tuple of (np.ndarray, np.ndarray)

embedding_peaksshape (num_peaks, 2): The UMAP embedding of the peaks.
peak_matshape (num_peaks, num_topics): The min-max normalized topic-peak matrix.

scdori.downstream.get_top_activators_per_topic(grn_final, tf_names, latent_all_torch, selected_topics=None, top_k=10, clamp_value=1e-08, zscore=True, figsize=(25, 10), out_fig=None)[source]#

Identify and plot top activator transcription factors per topic (Topic regulators, TRs).

Parameters#

grn_finalnp.ndarray or torch.Tensor: An array of shape (num_topics, num_tfs, num_genes), representing an activator GRN.
tf_nameslist of str: List of TF names, length = num_tfs.
latent_all_torchnp.ndarray or torch.Tensor: scDoRI latent topic activity of shape (num_cells, num_topics). Not always used, but can be referenced.
selected_topicslist of int, optional: Which topics to analyze. If None, all topics are used.
top_kint, optional: Number of top TFs to select per topic. Default is 10.
clamp_valuefloat, optional: Small cutoff to avoid division by zero. Default is 1e-8.
zscorebool, optional: If True, apply z-score normalization across topics in the final matrix. Default is True.
figsizetuple, optional: Size for the Seaborn clustermap. Default is (25, 10).
out_figstr or Path, optional: If provided, the figure is saved to this path; otherwise it is shown.

Returns#

tuple

df_topic_grnpd.DataFrame: The final DataFrame of shape (#topics, #TF).
selected_tflist of str: A sorted list of TFs used in the final clustermap.

scdori.downstream.get_top_repressor_per_topic(grn_final, tf_names, latent_all_torch, selected_topics=None, top_k=5, clamp_value=1e-08, zscore=True, figsize=(25, 10), out_fig=None)[source]#

Identify and plot top repressor transcription factors per topic.

Parameters#

grn_finalnp.ndarray or torch.Tensor: An array of shape (num_topics, num_tfs, num_genes), representing a repressor GRN.
tf_nameslist of str: List of TF names, length = num_tfs.
latent_all_torchnp.ndarray or torch.Tensor: scDoRI latent topic activity of shape (num_cells, num_topics).
selected_topicslist of int, optional: Which topics to analyze. If None, all topics are used.
top_kint, optional: Number of top TFs to select per topic. Default is 5.
clamp_valuefloat, optional: Small cutoff to avoid division by zero. Default is 1e-8.
zscorebool, optional: If True, apply z-score normalization across topics in the final matrix. Default is True.
figsizetuple, optional: Size for the Seaborn clustermap. Default is (25, 10).
out_figstr or Path, optional: If provided, the figure is saved to this path; otherwise it is shown.

Returns#

tuple

df_plotpd.DataFrame: The final DataFrame of shape (#topics, #TF).
selected_tflist of str: A sorted list of TFs used in the final clustermap.

scdori.downstream.load_best_model(model, best_model_path, device)[source]#

Load the best model weights from disk into the given model.

Parameters#

modeltorch.nn.Module: The model instance to which the weights will be loaded.
best_model_pathstr or Path: Path to the file containing the best model weights.
devicetorch.device: The device (CPU or CUDA) where the model will be moved.

Returns#

torch.nn.Module: The same model, now loaded with weights and set to eval mode.

Raises#

FileNotFoundError: If the specified best_model_path does not exist.

scdori.downstream.plot_topic_activation_heatmap(rna_anndata, groupby_key='celltype', aggregation='median')[source]#

Compute aggregated scDoRI latent topic activation across groups, then plot a clustermap.

Parameters#

rna_anndataanndata.AnnData: An AnnData object containing scDoRI latent factors in obsm[“X_scdori”].
groupby_keystr, optional: Column in rna_anndata.obs by which to group cells. Default is “celltype”.
aggregationstr, optional: Either “median” or “mean” for aggregating factor values per group. Default is “median”.

Returns#

pd.DataFrame: The transposed aggregated DataFrame (topics x groups).

Notes#

Uses a Seaborn clustermap to visualize the aggregated data.

scdori.downstream.save_regulons(grn_matrix, tf_names, gene_names, num_topics, output_dir, mode='activator')[source]#

Save regulons (TF-gene links across topics) for each TF based on a given GRN matrix.

Parameters#

grn_matrixnp.ndarray: A GRN matrix of shape (num_topics, num_tfs, num_genes).
tf_nameslist of str: List of transcription factor names, length = num_tfs.
gene_nameslist of str: List of gene names, length = num_genes.
num_topicsint: Number of topics in the GRN matrix.
output_dirstr: Directory where the regulon files will be saved.
modestr, optional: “activator” or “repressor”, used to name the output subdirectory/files.

Returns#

None: Saves individual TSV files for each TF in output_dir of shape (num_topics, num_genes), where non-zero values represent a link.

scdori.downstream.visualize_downstream_targets(rna_anndata, gene_list, score_name='target_score', layer='log')[source]#

Visualize the average expression of given genes on a UMAP embedding.

Uses scanpy.tl.score_genes to compute a gene score, then plots using scanpy.pl.umap.

Parameters#

rna_anndataanndata.AnnData: The AnnData object containing RNA data with .obsm[“X_umap”].
gene_listlist of str: A list of gene names to score.
score_namestr, optional: Name of the resulting gene score in rna_anndata.obs. Default is “target_score”.
layerstr, optional: Which layer to use if needed in score_genes. Default is “log”.

Returns#

None: Plots the UMAP colored by the computed gene score.

scdori.dataloader.create_minibatch(device, index_matrix, rna_anndata, atac_anndata, num_cells, tf_indices, encoding_batch_onehot)[source]#

Create a minibatch of required input tensors using integer indices of cells.

Parameters#

devicetorch.device: The device (CPU or CUDA) to which the data should be moved.
index_matrixtorch.Tensor: A 1D tensor containing integer indices of the cells in the minibatch.
rna_anndataanndata.AnnData: AnnData object for RNA data. The .X matrix should contain RNA counts or expression values.
atac_anndataanndata.AnnData: AnnData object for ATAC data. The .X matrix should contain accessibility counts.
num_cellsnp.ndarray: A NumPy array (N x 1) indicating the number of cells represented by each row (if using metacells). For single-cell level data, this may be an array of ones.
tf_indicesnp.ndarray: Indices corresponding to transcription factors (TFs) in the RNA AnnData.
encoding_batch_onehotnp.ndarray: A one-hot encoded matrix representing batch information for each cell (cells x num_batches).

Returns#

tuple

A tuple containing: - input_matrix (torch.Tensor): Concatenated RNA and ATAC input of shape (B, g + p),

where B is batch size, g is the number of genes, p is the number of peaks. Values are floats on the given device.

tf_exp (torch.Tensor): RNA expression values for TFs, shape (B, num_tfs).
library_size_value (torch.Tensor): Log-scale library sizes for RNA and ATAC, shape (B, 2).
num_cells_value (torch.Tensor): Number of cells per row in the minibatch (B, 1).
input_batch (torch.Tensor): One-hot batch-encoding, shape (B, num_batches).

Notes#

This function converts sparse arrays to dense if necessary.
ATAC counts are converted from insertion counts to fragment counts by using (x + 1) // 2.

scdori.data_io.load_scdori_inputs(config_file)[source]#

Load RNA & ATAC data (.h5ad files), plus gene-peak distance and in silico chip-seq matrix.

Parameters#

config_fileobject

A configuration file containing the attributes: - data_dir : pathlib.Path

The base directory for input data.

output_subdirstr
The subdirectory where output files are located.
rna_metacell_filestr
The filename for the RNA data (single cell or metacell) (H5AD).
atac_metacell_filestr
The filename for the ATAC data (single cell or metacell) (H5AD).
gene_peak_distance_filestr
The filename for the NumPy array with gene-peak distance matrix.
insilico_chipseq_act_filestr
The filename for the in silico ChIP-seq activator matrix.
insilico_chipseq_rep_filestr
The filename for the in silico ChIP-seq repressor matrix.

Returns#

tuple

A tuple containing: rna_metacell : anndata.AnnData

RNA data loaded from H5AD.

atac_metacellanndata.AnnData: ATAC data loaded from H5AD.
gene_peak_disttorch.Tensor: A tensor of shape (num_genes, num_peaks) representing gene-peak distances.
insilico_acttorch.Tensor: A tensor of shape (num_peaks, num_motifs) for in silico ChIP-seq (activator) embeddings.
insilico_reptorch.Tensor: A tensor of shape (num_peaks, num_motifs) for in silico ChIP-seq (repressor) embeddings.

scdori.data_io.save_model_weights(model, path: Path, tag: str)[source]#

Save model weights to a specified path with a given tag.

Parameters#

modeltorch.nn.Module: The PyTorch model whose state_dict is to be saved.
pathpathlib.Path: The directory path where the weights file will be saved.
tagstr: An identifier to include in the saved filename (e.g., “best_eval”).

Returns#

None

scDORI training#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Parameters#

Parameters#

Parameters#

Parameters#

Returns#

Attributes#

Parameters#

Returns#

Parameters#

Attributes#

Parameters#

Returns#

Parameters#

Returns#

Steps#

Returns#

Notes#

Parameters#

Returns#

Parameters#

Returns#

Notes#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Raises#

Steps#

Parameters#

Returns#

Steps#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Raises#

Parameters#

Returns#

Notes#

Parameters#

Returns#

Parameters#

Returns#

Parameters#

Returns#

Notes#

Parameters#

Returns#

Parameters#

Returns#

This Page