Bulk RNA seq pipeline

Snakemake-based

Bioinformatics
automation
pipeline

A Snakemake-based pipeline for preprocessing, quality control, alignment, quantification, and transformation of bulk RNA-seq data. Designed for flexibility and reproducibility, it supports both single-end and paired-end reads and produces normalized count matrices ready for downstream analysis.

Bulk RNA-seq Pipeline Workflow Snakemake-based processing pipeline for transcriptomic analysis 📋 Data Input Sample IDs & FASTQ file organization raw_data/ 🔍 Quality Control FastQC + MultiQC assessment QC/raw/ ✂️ Preprocessing Fastp trimming & deduplication trimmed_data/ 📊 Post-Trim QC Validation of preprocessing QC/trimmed/ 🧬 Reference HISAT2 index + GTF annotations hg38/ 🎯 Alignment HISAT2 splice-aware alignment BAM files 🔧 Optimization SAMtools sorting & indexing sorted BAM 🧮 Quantification FeatureCounts gene-level counting raw counts 🔄 Normalization Multi-method transformation TPM log2TPM VST ✅ Analysis Ready Normalized matrices for downstream analysis final datasets ---

🧬 Features

  • Download and organize raw FASTQ files
  • Quality control with FastQC and MultiQC
  • Adapter trimming and deduplication using Fastp
  • Alignment using HISAT2
  • Gene-level quantification with FeatureCounts
  • Aggregation and transformation of expression data (TPM, log2TPM, VST)
  • Configurable support for single- or paired-end reads

🧰 Requirements

  • Snakemake
  • Python ≥ 3.8
  • External tools: fastqc, fastp, multiqc, hisat2, samtools, featureCounts (Subread)
  • Python libraries: pandas, numpy (used in aggregate_counts.py)

📂 Directory Structure

.
├── config.yaml               # Set sample IDs, paths, paired-end flag, etc.
├── Snakefile                 # Main Snakemake workflow
├── aggregate_counts.py       # Aggregates counts and generates TPM, log2TPM, VST matrices
├── input/
│   └── sample_ids.txt        # List of SRR IDs or sample names
├── raw_data/                 # Input FASTQ files
├── trimmed_data/             # Output from Fastp
├── QC/
│   ├── raw/                  # FastQC + MultiQC for raw data
│   └── trimmed/              # FastQC + MultiQC for trimmed data
├── alignment/                # Aligned BAM files (sorted)
├── counts/                   # Count matrices (raw, TPM, log2TPM, VST)
└── ref_genomes/
    └── hg38/                 # Contains HISAT2 index and GTF file