Sequence Alignment and Trimming Deduplication¶

To optimize performance and reduce storage usage, the 3t-seq pipeline implements a deduplication mechanism for expensive processing steps, specifically read trimming and sequence alignment.

Overview¶

The pipeline uses content-based hashing to uniquely identify every processing step. Think of a "hash" as a unique barcode or fingerprint that represents a specific combination of input data and settings. This ensures that expensive computational work (like mapping millions of reads to a genome) is only performed once. If two samples have the exact same barcode, the software knows they are identical and skips re-doing the work.

The Hashing Mechanism¶

The pipeline computes a hierarchy of these SHA256 barcodes/hashes for every sample in the SAMPLE_HASHES registry before the workflow starts.

1. Trimming Hash¶

This is the root of the dependency tree. It is derived from:

Input Files: Absolute paths to the raw FASTQ files.
Processing Parameters: The specific Trimmomatic settings (e.g., adaptive vs. fixed string).
Protocol: Whether the library is Single-End or Paired-End.

2. Alignment Hash¶

The alignment hash ensures that any upstream changes propagate correctly. It depends on:

The Trimming Hash.
The STAR parameters (default or overrides).
The Genome Configuration (labels, FASTA, GTF).

3. Analysis Hashes¶

Higher-level analysis hashes (like starTE or MarkDup) are derived from the upstream Alignment Hash and module-specific settings:

starTE: Alignment Hash + Strandedness + starTE-specific parameters.
MarkDup: Alignment Hash + Picard settings.

Shared Result Cache (`_shared/`)¶

Results for hashed tasks are stored in a centralized directory structure:

results/
├── trim/
│   └── _shared/
│       └── <trim_hash>/
│           └── <sample>.fastq.gz
└── alignments/
    └── star/
        └── _shared/
            └── <align_hash>/
                └── <sample>.Aligned.sortedByCoord.out.bam

Preservation of Per-Series Structure¶

Instead of recalculating results, the pipeline checks if the corresponding barcode exists in the _shared/ folder. If found, it creates a symbolic link from your library's folder to the shared file.

Think of a symbolic link as a shortcut on your desktop, or a "see page 42" reference in a lab notebook. It looks and acts like the actual file, but takes up zero extra hard-drive space.

results/alignments/star/GSE123456/sample1.bam -> ../_shared/<hash>/sample1.bam

Benefits¶

Performance: Drastically reduces compute time when analyzing overlapping cohorts or re-running samples.
Storage: Prevents redundant storage of large BAM and FASTQ files.
Reliability: Ensures that identical inputs always produce identical, shared outputs, reducing variability.

Usage Transparency¶

This mechanism is entirely transparent to the user. No changes to configuration files are required; the pipeline handles routing, hashing, and link creation automatically.