Tutorial: Running the Test Dataset¶
This tutorial walks through running the bundled integration test from scratch. By the end you will have produced real output files and understood every config option and sample-sheet column — making it straightforward to adapt the pipeline to your own data.
The test dataset is a chr19 subset of GSE130735, a paired-end mouse lung RNA-seq experiment comparing wild-type (WT) and knockout (KO) animals.
1. Set Up the Environment¶
If you haven't done this yet, follow Getting Started to install Pixi, git-lfs, and the pipeline dependencies.
git clone https://github.com/boulardlab/3t-seq.git
cd 3t-seq
pixi run -e dev setup # pull LFS test data
pixi install # install all tools
Using the dev branch
When cloning the repository, the main branch is checked out by default.
Many of the feature described here are available only in the dev branch.
To switch to the dev branch, use:
2. The Test Data¶
The integration test lives under .tests/integration/.
.tests/integration/
├── sample-sheet.csv ← the 4-sample manifest
├── configs/
│ ├── local-references.yaml ← config used for the test
│ └── ...
├── profiles/
│ └── laptop/ ← resource settings for local execution
└── GSE130735-subset/ ← chr19-subset FASTQ files (via git-lfs)
The sample sheet¶
name,filename_1,filename_2,genotype
SRX5795112_SRR9016958,GSE130735-subset/SRX5795112_SRR9016958_1.fq.gz,GSE130735-subset/SRX5795112_SRR9016958_2.fq.gz,WT
SRX5795113_SRR9016959,GSE130735-subset/SRX5795113_SRR9016959_1.fq.gz,GSE130735-subset/SRX5795113_SRR9016959_2.fq.gz,WT
SRX5795117_SRR9016963,GSE130735-subset/SRX5795117_SRR9016963_1.fq.gz,GSE130735-subset/SRX5795117_SRR9016963_2.fq.gz,KO
SRX5795118_SRR9016964,GSE130735-subset/SRX5795118_SRR9016964_1.fq.gz,GSE130735-subset/SRX5795118_SRR9016964_2.fq.gz,KO
Four columns:
name— a unique sample identifier (required, fixed column name).filename_1/filename_2— paths to the paired-end FASTQ files (required for paired-end, fixed column names). Paths are relative to the directory from which Snakemake is launched.genotype— a metadata column you choose freely. Here it encodes the biological condition used for the differential expression contrast (WT vs KO). The column name must matchdeseq2.variablein the config (see below).
The config¶
comparisons:
- name: GSE130735-subset # (1)
protocol: pe # (2)
sample_sheet: sample-sheet.csv # (3)
trimmomatic: "ILLUMINACLIP:..." # (4)
star: "--seedSearchStartLmax 30 ..."
bamCoverage: "--binSize 10 --normalizeUsing None"
deseq2:
test: Wald # (5)
variable: genotype # (6)
reference_level: WT # (7)
disable_TE_analysis: false
disable_salmonTE_analysis: false
disable_tRNA_analysis: false
globals:
results_folder: results/ # (8)
genome:
label: mm10 # (9)
fasta_path: references/GRCm38.primary_assembly.genome.chr19.fa # (10)
gtf_path: references/MGI.chr19.gff3
annotation_type: mgi
selected_chromosomes:
- chr19 # (11)
- A label for this comparison — used to name output subdirectories.
pe= paired-end;se= single-end. Must match the columns in your sample sheet.- Path to the sample sheet, relative to the Snakemake working directory.
- Fixed Trimmomatic parameters. Omit this key to use the defaults, or set
adaptive: trueto let the pipeline derive parameters automatically from FastQC output. - DESeq2 test type (
WaldorLRT). - Which sample-sheet column to use for the contrast. Must match a column name in the CSV.
- The level of
variableused as the baseline (denominator) of the fold-change. Here WT is the reference, so positive LFC means higher in KO. - Where all output files will be written.
- Genome label. The pipeline downloads references automatically for
mm10andmm39. - For the test we override with local chr19-only files so the run completes quickly.
- Restrict analysis to chr19 — drastically speeds up the integration test.
3. Run the Pipeline¶
From the repository root:
This expands to:
pixi run snakemake \
--profile .tests/integration/profiles/laptop \
--configfile .tests/integration/configs/local-references.yaml \
--directory .tests/integration
--profilesets resource limits and executor (local, 4 cores by default).--configfilepoints to the config above.--directorysets the working directory, so relative paths in the sample sheet resolve correctly.
Snakemake will print a job plan, then start executing. Expected runtime on a laptop: 10–20 minutes.
Dry run first
See exactly what will be executed without running anything:
4. Explore the Results¶
When the run finishes, outputs are in .tests/integration/results/:
results/
├── qc/
│ └── multiqc/GSE130735-subset/multiqc_report.html ← QC overview
├── alignments/
│ └── star/GSE130735-subset/ ← BAM files
├── analysis/
│ ├── tables/GSE130735-subset/ ← count matrices
│ └── pictures/GSE130735-subset/ ← DESeq2 plots
└── tRNA_coverage/
└── GSE130735-subset/ ← tRNA quantification
Open results/qc/multiqc/GSE130735-subset/multiqc_report.html in your browser for a
summary of read counts, mapping rates, and duplication metrics.
5. Generate a Snakemake Report¶
This creates a report.zip in .tests/integration/. Unzip it and open report.html for
an interactive provenance report that includes tool versions, rule parameters, and QC plots.
Next Steps¶
Now that you've run the test, adapt the pipeline to your own data:
- Preparing Data & Samples: how to write your own sample sheet and config.
- Advanced Profiles: resource settings and HPC/Slurm execution.
- Running & Reporting: execution patterns and monitoring.