Hyperparameter Configuration Guide#

This guide provides a comprehensive reference for configuring feature selection and training parameters when using scDoRI.

Input Feature Selection#

In the preprocessing configuration file (preprocessing_pipeline/config.py), you can define the number of peaks, genes, and transcription factors (TFs) to be used in model training.

Although you may increase the number of input features, your final selection should reflect the available GPU memory. The default configuration supports GPUs with approximately 12–15 GB of memory.
If you are using GPUs with higher memory capacity, you may expand the feature set for finer granularity.

Manual Inclusion of Specific Genes and TFs

You can force the inclusion of certain genes or TFs using the genes_user and tfs_user entries in the config file.
This is particularly useful if you have known regulators or candidate markers from prior work or literature.
Ensure that all TFs in the tfs_user list have corresponding entries in your motif database. TFs without motif matches will be excluded during motif scanning.

Motif File Settings#

The motif database must be provided in MEME format. The default setup includes cisBP files, but you may use alternative databases, provided they match the same structure.
Motif scanning is executed using FIMO (Grant et al., https://meme-suite.org/meme/doc/fimo.html), via the tangermeme (https://tangermeme.readthedocs.io/en/latest/tutorials/Tutorial_D1_FIMO.html, author Jacob Schreiber).
Use the motif_match_pvalue_threshold parameter to control the stringency of motif matching. A stricter cutoff (lower p-value) yields fewer but more confident matches.

TF - Peak Correlation: Empirical Filtering#

scDoRI computes TF expression ?-peak acessibility Pearson correlations at the metacell level to enhance TF–peak specificity (insilico-ChIP-seq).
Adjust the correlation_percentile parameter to prune weak associations.
While not yet natively supported, advanced users can assign different correlation thresholds for activators (positive) and repressors (negative) in the source code for improved regulatory resolution.

Metacell Construction#

Metacells are formed via high-resolution Leiden clustering performed on Harmony-corrected PCA embeddings.
For robust correlation analysis, aim for at least 50 metacell clusters.
If fewer than 50 clusters are obtained, consider increasing the leiden_resolution parameter in the config.

Enhancer - Gene Distance Window#

scDoRI links enhancers to genes within a fixed genomic window. By default, this is set to 150 kb upstream and downstream of gene-body for the human genome.
Expanding this window may increase sensitivity but also introduces a higher chance of spurious associations.
Conversely, reducing the window will favor specificity but may exclude distal regulatory elements.

Gene Annotation Management

Default GTF files for hg38 (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.primary_assembly.basic.annotation.gtf.gz) and mm10 (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/gencode.vM18.basic.annotation.gtf.gz) are included.
Users can provide their own GTF annotation by setting a URL or local file path in the config file.

Model Training Settings (scdori/config.py)#

Choosing the Number of Topics

A recommended starting point is: #topics = expected number of cell types + 10.
Topics with low activity will be suppressed due to softmax normalization.
Adding more topics enables modeling of fine-grained programs, but increases memory usage as each topic has its own GRN.

Regularisation Strategies

Regularisation is applied using L1/L2 penalties on the following matrices:
- Topic - Peak
- Gene - Peak
- Topic - TF Expression
- GRN: Topic - TF - Gene 3D matrices
Higher regularisation promotes sparsity and easier biological interpretation.
Consider different regularisation strengths for activator and repressor GRNs.

Epoch Configuration and Training Length

Use the following heuristic based on number of training steps to set the number of epochs for Phase 1:

epochs = 60000 training steps / (number_of_cells / batch_size)
### For example, 30,000 cells and batch_size = 128:
### epochs = 60000 / (30000 / 128) = 256 epochs

Set the patience parameter to approximately 5% of the total epoch count, but minimum of 5. This is used to monitor the number of epochs to wait for before stopping training when validation loss doesn’t improve (early stopping)

Phase 2: GRN Inference and Fine-Tuning

For Phase 2, you can reduce the total training steps by half (i.e., ~15,000 updates), but minimum of 10 epochs.
Due to added complexity in GRN logic (e.g., 3D tensors), wall time per epoch is higher than phase 1.