Preprocessing Pipeline#

preprocessing_pipeline.data_io.create_dir_if_not_exists(directory: Path) None[source]#

Create a directory (and parent directories) if it does not already exist.

Parameters#

directorypathlib.Path

The directory path to create.

Returns#

None

If the directory already exists, does nothing. Otherwise, it is created along with any necessary parent folders.

preprocessing_pipeline.data_io.load_anndata(data_dir: Path, rna_file: str, atac_file: str) tuple[AnnData, AnnData][source]#

Load RNA and ATAC data from disk into AnnData objects.

Parameters#

data_dirpathlib.Path

Base directory containing the .h5ad files.

rna_filestr

The filename for the RNA data (H5AD).

atac_filestr

The filename for the ATAC data (H5AD).

Returns#

(AnnData, AnnData)

A tuple containing: - data_rna: AnnData

The RNA AnnData object.

  • data_atac: AnnData

    The ATAC AnnData object.

Notes#

Both .h5ad files are expected to be located in data_dir. This function logs the path from which each file is loaded.

preprocessing_pipeline.data_io.save_processed_datasets(data_rna: AnnData, data_atac: AnnData, out_dir: Path) None[source]#

Save processed RNA and ATAC AnnData objects with matching cell order.

Parameters#

data_rnaanndata.AnnData

The RNA AnnData object, potentially subset or processed.

data_atacanndata.AnnData

The ATAC AnnData object, potentially subset or processed.

out_dirpathlib.Path

The directory where the processed .h5ad files will be saved.

Returns#

None

Writes two files, rna_processed.h5ad and atac_processed.h5ad, ensuring both have the same set and order of cells.

Notes#

  • This function intersects the cell indices (obs_names) of data_rna and data_atac to keep only the common cells.

  • The final shapes of the saved AnnData objects are logged.

preprocessing_pipeline.filtering.intersect_cells(data_rna: AnnData, data_atac: AnnData) tuple[AnnData, AnnData][source]#

Keep only the cells that are present in both RNA and ATAC datasets.

Parameters#

data_rnaanndata.AnnData

The RNA single-cell data.

data_atacanndata.AnnData

The ATAC single-cell data.

Returns#

(data_rna_sub, data_atac_sub)tuple of anndata.AnnData

Two AnnData objects that share the same set of cells, in the same order.

Notes#

  • The function finds the intersection of obs_names (cell barcodes) between the two AnnData objects.

  • It logs the new shapes of the intersected RNA and ATAC data.

preprocessing_pipeline.filtering.remove_mitochondrial_genes(data_rna: AnnData, mito_prefix: str = 'mt-') AnnData[source]#

Remove mitochondrial genes from the RNA data based on a gene name prefix.

Parameters#

data_rnaanndata.AnnData

The RNA AnnData containing gene expression counts.

mito_prefixstr, optional

The prefix used to identify mitochondrial genes. Default is “mt-“.

Returns#

anndata.AnnData

A new AnnData object without mitochondrial genes. The original data is not modified in-place.

Notes#

  • Mitochondrial genes are identified by checking if gene names start with mito_prefix (case-insensitive).

  • A boolean column “mt” is added to data_rna.var to indicate whether each gene was marked as mitochondrial. This subset is then removed from the data.

  • Logs the number of removed genes.

preprocessing_pipeline.correlation.compute_in_silico_chipseq(atac_matrix: ndarray, rna_matrix: ndarray, motif_scores: DataFrame, percentile: float = 99.9, n_bg: int = 10000) tuple[ndarray, ndarray][source]#

Compute correlation-based in-silico ChIP-seq embeddings, separating activator and repressor signals.

Steps#

  1. Compute the Pearson correlation for each (peak, TF) pair.

  2. Retain only the positive part as an “activator” correlation and the negative part as a “repressor” correlation.

  3. For each TF, estimate significance thresholds for correlations from background peaks (i.e., peaks with minimal motif scores for each TF).

  4. Multiply thresholded correlations by the motif scores to produce final activator and repressor embeddings.

  5. Apply MinMax scaling to each TF separately.

Parameters#

atac_matrixnp.ndarray

Shape (n_metacells, n_peaks). Matrix of peak accessibility (or insertion counts) where each row is a metacell and each column is a peak.

rna_matrixnp.ndarray

Shape (n_metacells, n_tfs). Matrix of TF expression where each row is a metacell and each column is a TF.

motif_scorespd.DataFrame

Shape (n_peaks, n_tfs). DataFrame containing motif scores for each (peak, TF) pair.

percentilefloat, optional

The percentile threshold used to determine correlation significance (for activators and repressors). Default is 99.9.

n_bgint, optional

Number of peaks per TF used to form the “background” distribution (lowest motif scores). Default is 10000.

Returns#

insilico_chipseq_act_signp.ndarray

Shape (n_peaks, n_tfs). Final activator scores after thresholding correlations and combining with motif scores. Scaled to [0, 1].

insilico_chipseq_rep_signp.ndarray

Shape (n_peaks, n_tfs). Final repressor scores after thresholding correlations and combining with motif scores. Scaled to [-1, 0].

Notes#

  • Positive correlation is interpreted as activator-like; negative correlation is repressor-like.

  • A background distribution is formed from peaks with the smallest motif scores, then used to find correlation thresholds at the requested percentile.

  • The final arrays are MinMax-scaled: * The repressor array is scaled into the range [-1, 0]. * The activator array is scaled into the range [0, 1].

preprocessing_pipeline.utils.compute_gene_peak_distance_matrix(data_atac: DataFrame, data_rna: DataFrame, gene_coordinates_intersect: DataFrame) ndarray[source]#

Compute a distance matrix between peaks (ATAC) and genes (RNA), accounting for strand.

This function:#

  1. Expects the ATAC data (data_atac.var) to have columns “chr”, “start”, “end”, and “peak_name”.

  2. Expects the gene coordinates DataFrame (gene_coordinates_intersect) to have columns

    [“chr_gene”, “start”, “end”, “strand”, “gene”].

  3. For each gene, we compute a distance to each peak by:
    • 0 if the peak midpoint is within gene-body or promoter (5kb upstream of TSS by default),

    e.g., if on the “+” strand and midpoint is between (start - 5000) and end => distance=0. If on the “-” strand and midpoint is between start and (end + 5000) => distance=0. - Otherwise, distance = min(|mid - start|, |mid - end|). - If the chromosome of the gene != chromosome of the peak => distance = -1.

Parameters#

data_atacpd.DataFrame

The ATAC dataset. Its .var should contain “chr”, “start”, “end”, “peak_name”.

data_rnapd.DataFrame

The RNA dataset. Its .var is expected to have gene names as indices or relevant columns.

gene_coordinates_intersectpd.DataFrame

A DataFrame with gene info, columns at least: [“chr_gene”, “start”, “end”, “strand”, “gene”].

Returns#

np.ndarray

A 2D NumPy array of shape (n_genes, n_peaks) representing gene-peak distances. Each row corresponds to a gene (matching data_rna.var.index), and each column corresponds to a peak (matching data_atac.var.index order).

preprocessing_pipeline.utils.create_extended_gene_bed(gtf_df: DataFrame, final_genes: list[str], window_size: int, chrom_sizes_path: Path | None = None) DataFrame[source]#

Create an extended gene BED-like DataFrame by adding a window around each gene.

For each gene in final_genes, take the gene feature from gtf_df and extend the start and end coordinates by window_size. If a chromosome sizes file is provided, clamp the extended coordinates within valid chromosome boundaries.

Parameters#

gtf_dfpd.DataFrame

A DataFrame containing GTF entries (e.g., loaded by gtfparse). Expected columns: [“feature”, “gene_name”, “seqname”, “start”, “end”, “strand”, …].

final_geneslist of str

The list of gene names to be extended.

window_sizeint

The number of base pairs to extend on each side of the gene.

chrom_sizes_pathpathlib.Path, optional

A path to a tab-delimited file of chromosome sizes (e.g., from UCSC), with columns [chrom, size]. If provided, coordinates are clamped to [0, chrom_size].

Returns#

pd.DataFrame

A DataFrame with columns [“chr”, “start_new”, “end_new”, “gene_name”, “strand”] representing the extended regions for each gene.

Notes#

  • Only rows with feature == “gene” are considered in the GTF.

  • If a gene is not found in gtf_df, it is omitted.

  • The chr column is set from the GTF’s “seqname” field.

preprocessing_pipeline.peak_selection.keep_promoters_and_select_hv_peaks(data_atac: AnnData, total_n_peaks: int, cluster_key: str = 'leiden', promoter_col: str = 'is_promoter') AnnData[source]#

Retain all promoter peaks and then select highly variable (HV) peaks among non-promoters.

Steps#

  1. Identify peaks marked as promoters where var[promoter_col] == True.

  2. Keep all promoter peaks unconditionally.

  3. Among non-promoter peaks, select the top (total_n_peaks - #promoters) peaks by standard deviation across clusters.

  4. If the number of promoter peaks alone is >= total_n_peaks, keep all promoters.

  5. Combine the sets of promoter and HV non-promoter peaks.

Parameters#

data_atacanndata.AnnData

An AnnData containing ATAC data, with a boolean promoter column in .var.

total_n_peaksint

The target total number of peaks to keep. May be exceeded if promoter peaks alone surpass this number.

cluster_keystr, optional

Column in data_atac.obs defining cluster labels for HV peak selection. Default is “leiden”.

promoter_colstr, optional

Column in data_atac.var indicating which peaks are promoters. Default is “is_promoter”.

Returns#

anndata.AnnData

A subset of data_atac containing all promoter peaks plus HV non-promoter peaks.

Notes#

  • If promoter_col is missing, falls back to standard HV peak selection (without promoter logic).

  • The standard deviation for HV peaks is computed by select_highly_variable_peaks_by_std.

preprocessing_pipeline.peak_selection.select_highly_variable_peaks_by_std(data_atac: AnnData, n_top_peaks: int, cluster_key: str = 'leiden') AnnData[source]#

Select highly variable peaks based on the standard deviation of peak accessibility across clusters.

This function: 1. Groups cells by cluster_key in data_atac.obs. 2. Computes ATAC fragment counts and calculates mean accessibility

per peak for each cluster.

  1. Computes the standard deviation of each peak’s accessibility across clusters.

  2. Selects the top n_top_peaks peaks with the highest standard deviation.

Parameters#

data_atacanndata.AnnData

An AnnData containing ATAC data. Expects .obs[cluster_key] to exist.

n_top_peaksint

Number of peaks to retain based on the highest standard deviation across clusters.

cluster_keystr, optional

The column in data_atac.obs specifying cluster labels. Defaults to “leiden”.

Returns#

anndata.AnnData

A subset of data_atac containing only the selected peaks. If n_top_peaks >= data_atac.shape[1], returns the original data.

Notes#

  • If cluster_key is missing, a warning is logged and the original AnnData is returned.

  • Transformation (X + 1) // 2 to interpret insertions as fragment counts.

preprocessing_pipeline.motif_scanning.compute_motif_scores(bed_file: Path, fasta_file: Path, pwms_sub: dict, key_to_tf: dict, n_peaks: int, window: int = 500, threshold: float = 0.001) DataFrame[source]#

Compute a motif score matrix (n_peaks x n_TFs) by scanning peak sequences with FIMO (from tangermeme).

Steps#

  1. Read peak coordinates from bed_file.

  2. Extract sequences from fasta_file.

  3. Run FIMO on these sequences using the PWMs in pwms_sub.

  4. Aggregate scores per (peak, motif) combination; rename motifs by their TF name.

  5. Combine into a DataFrame of shape (n_peaks, n_TFs) and scale values to [0, 1].

Parameters#

bed_filepathlib.Path

The BED file containing peak coordinates. Column 4 must be the peak name.

fasta_filepathlib.Path

The path to the reference genome FASTA file.

pwms_subdict

A dictionary of position weight matrices (PWMs), e.g. from a .meme file, keyed by some motif identifier.

key_to_tfdict

A mapping from motif identifiers to transcription factor (TF) names.

n_peaksint

The total number of peaks expected (should match the number of lines in bed_file).

windowint, optional

Window size around the peak to extract sequences for motif scanning. Default is 500.

thresholdfloat, optional

Minimum p-value (or e-value) threshold for motif hits in FIMO. Default is 1e-3.

Returns#

pd.DataFrame

A DataFrame of shape (n_peaks, n_TFs). Rows are indexed by peak names from bed_file, columns are TF names. Values are scaled motif scores in [0, 1].

Notes#

  • If a TF does not appear in any motif hits, its column will be filled with zeros.

  • For each motif name in pwms_sub, we rename it to the corresponding TF from key_to_tf.

  • The final DataFrame is MinMax-scaled by each column (TF).

preprocessing_pipeline.motif_scanning.load_motif_database(motif_path: Path, final_tfs: list[str]) tuple[dict, dict][source]#

Read a .meme motif file and select only the motifs corresponding to TFs of interest.

Parameters#

motif_pathpathlib.Path

The path to the .meme file.

final_tfslist of str

A list of TF names for which we want motifs.

Returns#

(pwms_sub, key_to_tf)tuple of (dict, dict)
pwms_subdict

A dictionary of PWM matrices (motif_name -> PWM) only for the desired TFs.

key_to_tfdict

A mapping from motif_name to TF name (one motif per TF).

Notes#

  • This function expects that each motif key in the meme file can be parsed to extract a TF name. We assume a format like “MOTIF something Tbx5_…”.

preprocessing_pipeline.motif_scanning.run_bedtools_intersect(a_bed: Path, b_bed: Path, out_bed: Path) None[source]#

Run a ‘bedtools intersect’ command to keep intervals in file A that overlap intervals in file B.

Parameters#

a_bedpathlib.Path

Path to the BED file “A”.

b_bedpathlib.Path

Path to the BED file “B”.

out_bedpathlib.Path

Output path for the intersected results.

Returns#

None

Writes a new BED file to out_bed containing only intervals from A that intersect with B at least once (option -u -wa).

preprocessing_pipeline.motif_scanning.simulate_pwm_scoring(pwm_matrix, bed_file, fasta_file, window)[source]#

Simulate PWM scoring for a given motif using random data.

Parameters#

pwm_matrixnp.ndarray

Position Weight Matrix for the motif (rows = positions, columns = nucleotides).

bed_filepathlib.Path or str

The BED file containing peak locations.

fasta_filepathlib.Path or str

Path to the genome FASTA file.

windowint

The window size around peaks to consider for motif scanning.

Returns#

pd.DataFrame

A simulated DataFrame of motif scores with columns [‘motif_name’, ‘sequence_name’, ‘score’].

Notes#

  • In a real scenario, this would scan each sequence (peak ± window) with the PWM and return actual FIMO-like hits.

  • Here, random scores are generated for testing.

preprocessing_pipeline.metacells.create_metacells(data_rna: AnnData, data_atac: AnnData, grouping_key: str = 'leiden', resolution: float = 5.0, batch_key: str = 'sample') tuple[AnnData, AnnData][source]#

Create metacell-level RNA and ATAC AnnData objects by clustering cells and computing mean values per cluster. This function: 1. Normalizes and logs the RNA data, then runs PCA. 2. Uses Harmony integration for batch correction on the PCA embeddings. 3. Clusters the RNA data with Leiden at the specified resolution, storing the

cluster labels in data_rna.obs[grouping_key].

  1. Summarizes RNA expression and ATAC accessibility for each cluster by taking

    the mean of each feature across all cells in that cluster.

Parameters#

data_rnaanndata.AnnData

RNA single-cell data. A layer “counts” is added and re-assigned later. The shape is (n_cells, n_genes).

data_atacanndata.AnnData

ATAC single-cell data with the same set or superset of cell IDs in data_rna.obs_names. The shape is (n_cells, n_peaks).

grouping_keystr, optional

The key in data_rna.obs where the Leiden cluster labels will be stored. Default is “leiden”.

resolutionfloat, optional

The resolution parameter for Leiden clustering. Higher values yield more clusters. Default is 5.0.

batch_keystr, optional

The column in data_rna.obs indicating batch information for Harmony integration. Default is “sample”.

Returns#

(rna_metacell, atac_metacell)tuple of anndata.AnnData
  • rna_metacell : shape (#clusters, n_genes)

  • atac_metacell : shape (#clusters, n_peaks)

The .obs index is set to the cluster labels, and the .var is inherited from the original data_rna/data_atac.

Notes#

  • The function uses scanpy.external.pp.harmony_integrate for batch integration on the PCA representation stored in “X_pca_harmony”.

  • The ATAC data is transformed by (atac_vals + 1) // 2 to interpret insertions as fragment presence, following Martens et al. (2023).

  • Mean values are computed across cells in each cluster for both RNA and ATAC.

  • The original data_rna.X is restored to raw counts at the end.

preprocessing_pipeline.gene_selection.compute_hvgs_and_tfs(data_rna: AnnData, tf_names: list[str], user_genes: list[str] | None = None, user_tfs: list[str] | None = None, num_genes: int = 3000, num_tfs: int = 300, min_cells: int = 20) tuple[AnnData, list[str], list[str]][source]#

Compute sets of Highly Variable Genes (HVGs) and TFs (transcription factors) for scDoRI training.

This function:
  1. Identifies user-specified genes and TFs present in data_rna.

  2. Selects additional TFs by computing HVGs among potential TFs up to num_tfs.

  3. Selects non-TF HVGs up to num_genes (minus any user-specified genes and TFs).

  4. Combines these sets into a final AnnData subset and returns them.

Parameters#

data_rnaanndata.AnnData

The RNA single-cell data from which to select HVGs and TFs.

tf_nameslist of str

A list of all TF names (from a motif database or known TF list).

user_geneslist of str, optional

A list of user-specified genes that must be included in the final set, default is None.

user_tfslist of str, optional

A list of user-specified TFs that must be included in the final set, default is None.

num_genesint, optional

Total number of HVGs (non-TFs) desired. Default is 3000.

num_tfsint, optional

Total number of TFs desired. Default is 300.

min_cellsint, optional

Minimum number of cells in which a gene must be detected (e.g., nonzero) to be considered for HVG selection. Default is 20 (not fully enforced in this code snippet, but typically used with standard HVG filtering).

Returns#

data_rna_processedanndata.AnnData

A subset of the original data containing the selected HVGs and TFs.

final_geneslist of str

The final list of HVGs (non-TFs).

final_tfslist of str

The final list of TFs.

Notes#

  • HVG selection is done by scanpy.pp.highly_variable_genes, using normalized/log1p data.

  • User-provided genes and TFs are included by default, removing them from the HVG candidate pool if they were already selected.

  • TFs are not re-labeled or otherwise changed beyond this classification.

  • The column data_rna_processed.var[“gene_type”] is set to “HVG” or “TF” for each gene.

preprocessing_pipeline.gene_selection.filter_protein_coding_genes(data_rna: AnnData, gtf_df: DataFrame) AnnData[source]#

Retain only protein-coding genes in the RNA AnnData object based on GTF annotations.

Parameters#

data_rnaanndata.AnnData

RNA single-cell data.

gtf_dfpd.DataFrame

A GTF DataFrame (from load_gtf) containing columns like “gene_type”, “gene_name”.

Returns#

anndata.AnnData

The AnnData subset containing only protein-coding genes found in both the original data and the GTF’s “gene_type == ‘protein_coding’”.

preprocessing_pipeline.gene_selection.load_gtf(gtf_path: Path) DataFrame[source]#

Load gene coordinates from a GTF file into a pandas DataFrame using gtfparse.

Parameters#

gtf_pathpathlib.Path

The path to the GTF file (optionally gzipped).

Returns#

pd.DataFrame

A DataFrame containing parsed GTF records. The columns correspond to GTF fields such as “gene_name”, “gene_type”, “start”, “end”, etc.

preprocessing_pipeline.download.download_file(url: str, out_path: Path) None[source]#

Download a file from a given URL and save it locally using the wget command.

Parameters#

urlstr

The URL to download from.

out_pathpathlib.Path

The local file path where the downloaded file will be saved.

Returns#

None

If the file already exists, no download is performed. Otherwise, runs a shell command with wget to fetch the file.

preprocessing_pipeline.download.download_genome_references(genome_dir: Path, species: str, assembly: str, gtf_url: str | None = None, chrom_sizes_url: str | None = None, fasta_url: str | None = None) None[source]#

Download reference genome files (GTF, chrom.sizes, FASTA) for a given species and assembly.

  1. Resolve the URLs from user input or known defaults for recognized combos (mouse/mm10, human/hg38).

  2. Download each file if not already present.

  3. Decompress .gz files for GTF and FASTA.

Parameters#

genome_dirpathlib.Path

Directory where the reference files will be downloaded.

speciesstr

Species name (e.g. “mouse”, “human”).

assemblystr

Genome assembly version (e.g. “mm10”, “hg38”).

gtf_urlstr, optional

GTF file URL to override defaults. If None, use known default (if available).

chrom_sizes_urlstr, optional

Chromosome sizes file URL to override defaults. If None, use known default (if available).

fasta_urlstr, optional

FASTA file URL to override defaults. If None, use known default (if available).

Returns#

None

Files are downloaded into genome_dir. If a file already exists, no new download occurs.

preprocessing_pipeline.download.resolve_genome_urls(species: str, assembly: str, gtf_url: str, chrom_sizes_url: str, fasta_url: str) tuple[str, str, str][source]#

Resolve final URLs for GTF, chromosome sizes, and FASTA files based on species and assembly.

  1. If user provides URLs (gtf_url, chrom_sizes_url, fasta_url), use them directly.

  2. If not, attempt to use known defaults for recognized combos (e.g. mouse/mm10, human/hg38).

  3. If the combo is unrecognized and the user hasn’t provided all URLs, raise an error.

Parameters#

speciesstr

Species name (e.g. “mouse”, “human”).

assemblystr

Genome assembly (e.g. “mm10”, “hg38”).

gtf_urlstr or None

A user-provided URL for the GTF file, or None to try defaults.

chrom_sizes_urlstr or None

A user-provided URL for the chromosome sizes file, or None to try defaults.

fasta_urlstr or None

A user-provided URL for the FASTA file, or None to try defaults.

Returns#

(final_gtf_url, final_chrom_sizes_url, final_fasta_url)tuple of str

The resolved URLs for GTF, chromosome sizes, and FASTA. These may come from user-provided values or known defaults if recognized. If no defaults exist and user inputs are missing, raises ValueError.

preprocessing_pipeline.download.unzip_gz(file_path: Path, remove_input: bool = False) None[source]#

Decompress a GZIP file (i.e., .gz) in-place using gzip -d.

Parameters#

file_pathpathlib.Path

Path to the .gz file to be decompressed.

remove_inputbool, optional

If True, delete the original .gz file after successful decompression. Default is False.

Returns#

None

Decompresses the file in-place, leaving a new file without the .gz extension.