.. _training_hyperparameters: ================================================================== Hyperparameter Configuration Guide ================================================================== This guide provides a comprehensive reference for configuring feature selection and training parameters when using scDoRI. Input Feature Selection -------------------------- In the preprocessing configuration file (`preprocessing_pipeline/config.py`), you can define the number of peaks, genes, and transcription factors (TFs) to be used in model training. - Although you may increase the number of input features, your final selection should reflect the available GPU memory. The default configuration supports GPUs with approximately 12--15 GB of memory. - If you are using GPUs with higher memory capacity, you may expand the feature set for finer granularity. **Manual Inclusion of Specific Genes and TFs** - You can force the inclusion of certain genes or TFs using the `genes_user` and `tfs_user` entries in the config file. - This is particularly useful if you have known regulators or candidate markers from prior work or literature. - Ensure that all TFs in the `tfs_user` list have corresponding entries in your motif database. TFs without motif matches will be excluded during motif scanning. ------------------- Motif File Settings ------------------- - The motif database must be provided in MEME format. The default setup includes cisBP files, but you may use alternative databases, provided they match the same structure. - Motif scanning is executed using **FIMO** (Grant et al., https://meme-suite.org/meme/doc/fimo.html), via the **tangermeme** (https://tangermeme.readthedocs.io/en/latest/tutorials/Tutorial_D1_FIMO.html, author Jacob Schreiber). - Use the `motif_match_pvalue_threshold` parameter to control the stringency of motif matching. A stricter cutoff (lower p-value) yields fewer but more confident matches. -------------------------------------- TF - Peak Correlation: Empirical Filtering -------------------------------------- - scDoRI computes TF expression –-peak acessibility Pearson correlations at the metacell level to enhance TF--peak specificity (insilico-ChIP-seq). - Adjust the `correlation_percentile` parameter to prune weak associations. - While not yet natively supported, advanced users can assign different correlation thresholds for activators (positive) and repressors (negative) in the source code for improved regulatory resolution. ---------------------- Metacell Construction ---------------------- - Metacells are formed via high-resolution Leiden clustering performed on Harmony-corrected PCA embeddings. - For robust correlation analysis, aim for at least 50 metacell clusters. - If fewer than 50 clusters are obtained, consider increasing the `leiden_resolution` parameter in the config. ------------------------------ Enhancer - Gene Distance Window ------------------------------ - scDoRI links enhancers to genes within a fixed genomic window. By default, this is set to 150 kb upstream and downstream of gene-body for the human genome. - Expanding this window may increase sensitivity but also introduces a higher chance of spurious associations. - Conversely, reducing the window will favor specificity but may exclude distal regulatory elements. **Gene Annotation Management** - Default GTF files for `hg38` (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_47/gencode.v47.primary_assembly.basic.annotation.gtf.gz) and `mm10` (https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/gencode.vM18.basic.annotation.gtf.gz) are included. - Users can provide their own GTF annotation by setting a URL or local file path in the config file. Model Training Settings (`scdori/config.py`) -------------------------------------- **Choosing the Number of Topics** - A recommended starting point is: `#topics = expected number of cell types + 10`. - Topics with low activity will be suppressed due to softmax normalization. - Adding more topics enables modeling of fine-grained programs, but increases memory usage as each topic has its own GRN. **Regularisation Strategies** - Regularisation is applied using L1/L2 penalties on the following matrices: - Topic - Peak - Gene - Peak - Topic - TF Expression - GRN: Topic - TF - Gene 3D matrices - Higher regularisation promotes sparsity and easier biological interpretation. - Consider different regularisation strengths for activator and repressor GRNs. **Epoch Configuration and Training Length** - Use the following heuristic based on number of training steps to set the number of epochs for Phase 1: .. code-block:: text epochs = 60000 training steps / (number_of_cells / batch_size) ### For example, 30,000 cells and batch_size = 128: ### epochs = 60000 / (30000 / 128) = 256 epochs - Set the `patience` parameter to approximately 5% of the total epoch count, but minimum of 5. This is used to monitor the number of epochs to wait for before stopping training when validation loss doesn't improve (early stopping) **Phase 2: GRN Inference and Fine-Tuning** - For Phase 2, you can reduce the total training steps by half (i.e., ~15,000 updates), but minimum of 10 epochs. - Due to added complexity in GRN logic (e.g., 3D tensors), wall time per epoch is higher than phase 1.