Bulk RNA seq pipeline

Snakemake-based

Bioinformatics

Pipeline

Automation

Python

A Snakemake-based pipeline for preprocessing, quality control, alignment, quantification, and transformation of bulk RNA-seq data. Designed for flexibility and reproducibility, it supports both single-end and paired-end reads and produces normalized count matrices ready for downstream analysis.

---

🧬 Features

Download and organize raw FASTQ files
Quality control with FastQC and MultiQC
Adapter trimming and deduplication using Fastp
Alignment using HISAT2
Gene-level quantification with FeatureCounts
Aggregation and transformation of expression data (TPM, log2TPM, VST)
Configurable support for single- or paired-end reads

🧰 Requirements

Snakemake
Python ≥ 3.8
External tools: fastqc, fastp, multiqc, hisat2, samtools, featureCounts (Subread)
Python libraries: pandas, numpy (used in aggregate_counts.py)

📂 Directory Structure

.
├── config.yaml               # Set sample IDs, paths, paired-end flag, etc.
├── Snakefile                 # Main Snakemake workflow
├── aggregate_counts.py       # Aggregates counts and generates TPM, log2TPM, VST matrices
├── input/
│   └── sample_ids.txt        # List of SRR IDs or sample names
├── raw_data/                 # Input FASTQ files
├── trimmed_data/             # Output from Fastp
├── QC/
│   ├── raw/                  # FastQC + MultiQC for raw data
│   └── trimmed/              # FastQC + MultiQC for trimmed data
├── alignment/                # Aligned BAM files (sorted)
├── counts/                   # Count matrices (raw, TPM, log2TPM, VST)
└── ref_genomes/
    └── hg38/                 # Contains HISAT2 index and GTF file

🌐 Links

GitHub Repository