Open-source · Free forever · bioinfocodex.com

Decode biology.
One analysis at a time.

BioInfoCodex is your free, open-source bioinformatics codex — guided tools, step-by-step tutorials, and education resources that make powerful genomics analysis accessible to every researcher, regardless of coding experience.

100%
Free & open-source
15+
Organisms supported
3
Operating systems
40+
Gene families
Who we are

About BioInfoCodex

We are researchers who spent too much time fighting with software and too little time doing biology. BioInfoCodex exists to fix that — one guided tool, one tutorial, one pipeline at a time.

🎯
Our mission

Remove technical barriers between researchers and their data. Every tool includes step-by-step guidance, so your biological question comes first.

🔓
Always free

Every tool, tutorial, and resource on BioInfoCodex is and will remain free and open-source. No subscriptions, no paywalls, no hidden limits.

🌍
For everyone

Designed for beginners with no coding background, and flexible enough for experienced bioinformaticians who want to move faster.

🔬
Research-grade

Built on the same tools used in published research — HISAT2, DESeq2, GATK, samtools — properly configured and ready to use.

The team

People behind BioInfoCodex

Researchers and developers who believe great science software should be accessible to everyone.

👨‍🔬
Matrika Bhattarai
Founder & Lead Developer

Researcher specialising in RNA-seq and functional genomics. Researcher specialising in RNA-seq and functional genomics. Built RNAflow and founded BioInfoCodex to make bioinformatics accessible to every researcher worldwide.

Your name here
Contributor

BioInfoCodex is open-source. If you want to contribute a tool, tutorial, or bug fix, we'd love to have you on the team.

Collaborators welcome
Any expertise

Bioinformaticians, biologists, statisticians, web developers — all contributions welcome. Get in touch via the Contact page.

Roadmap

What's coming

BioInfoCodex is a growing platform. Here's what's in development — new tools and tutorials added regularly.

🧬
RNAflow v1.0
Live now

Complete RNA-seq pipeline with gene family analysis, 15 organisms, Run mode.

🔬
NGSuite
Q3 2025

Whole-genome sequencing, variant calling, SNP/INDEL annotation.

PlasmidMap
Q4 2025

Plasmid design, restriction site analysis, GenBank import/export.

🏔️
ChIPflow
2026

ChIP-seq and ATAC-seq — peak calling, motif analysis, chromatin accessibility.

🌿
MetaFlow
2026

Metagenomics, 16S rRNA, microbiome diversity and differential abundance.

📚
Education portal
In progress

Step-by-step tutorials for every major bioinformatics method.

Software

Bioinfo Tools

Free, guided pipelines for every major bioinformatics workflow. Each tool runs on your own computer — your data never leaves your machine.

✦ Available now
🧬
RNAflow
v1.0 · Released 2025
Live

Complete RNA-seq analysis pipeline — download raw data from NCBI, quality control, adapter trimming, genome alignment, gene counting, DESeq2 differential expression, volcano plots, heatmaps, PCA, and gene family analysis. Works on 15+ organisms including all major crop plants.

RNA-seqDESeq2HISAT2 FastQCfastpfeatureCounts ggplot215 organisms
🍎 macOS  ·  🐧 Linux  ·  🪟 Windows (WSL2)  ·  🐍 Python 3.6+  ·  📊 R 4.0+
In development
🔬
NGSuite
Expected Q3 2025
Soon

Whole-genome and exome sequencing pipeline. Variant calling with GATK, SNP/INDEL annotation, structural variant detection, population genetics.

WGS / WESGATK4BWA-MEM2VEP
PlasmidMap
Expected Q4 2025
Soon

Plasmid design and annotation. Import GenBank files, identify restriction sites, annotate features, export circular maps for publication.

Plasmid designGenBankRestriction sitesCircular maps
🏔️
ChIPflow
Expected 2026
Soon

ChIP-seq and ATAC-seq pipeline. Peak calling with MACS2, motif analysis with HOMER, differential chromatin accessibility, BigWig tracks.

ChIP-seqATAC-seqMACS2HOMER
🌿
MetaFlow
Expected 2026
Soon

Metagenomics and 16S rRNA microbiome pipeline. Taxonomic classification, alpha/beta diversity, differential abundance with DESeq2.

Metagenomics16S rRNAQIIME2Kraken2
🔭
scRNAflow
Expected 2026
Soon

Single-cell RNA-seq pipeline using Seurat. Cell clustering, UMAP visualisation, marker gene identification, trajectory analysis.

scRNA-seqSeuratUMAPCell typing
⚗️
ProteomicsLab
Expected 2026
Soon

Mass spectrometry proteomics. Label-free quantification, TMT/iTRAQ, PTM analysis, pathway enrichment, volcano plots for protein data.

ProteomicsMaxQuantLFQTMT
Education

Learn Bioinformatics

Step-by-step tutorials, database guides, and concept explanations — from downloading your first sequence to running a complete analysis pipeline.

Topics
Getting started
🚀 Introduction to bioinformatics
⚙️ Setting up your environment
🖥️ Using the terminal
Data & databases
🗄️ NCBI — what is it?
⬇️ How to download sequences
📦 Download SRA / RNA-seq data
🌐 Ensembl — genome downloads
📄 File formats explained
RNA-seq analysis
🧬 What is RNA-seq?
🔬 Quality control (FastQC)
🎯 Alignment concepts
📊 DESeq2 — the statistics
Resources
🗂️ Key bioinformatics databases
📖 Glossary
Getting started

Introduction to Bioinformatics

Bioinformatics is the science of using computers to make sense of biological data — DNA sequences, gene expression levels, protein structures, and more. This guide gives you the foundation you need before diving into any analysis.

1What is bioinformatics?

Bioinformatics combines biology, computer science, and statistics. Modern sequencing machines generate gigabytes of data per experiment. Bioinformatics tools process that data to answer biological questions — which genes are active? What mutations are present? Which proteins interact?

💡
Think of it like this: the sequencing machine is a camera, the raw data is the photograph, and bioinformatics is the darkroom where the image becomes meaningful.
2The central dogma — why it matters

DNA is transcribed into RNA, which is translated into protein. Most bioinformatics workflows target one of these molecules. RNA-seq measures gene expression (DNA → RNA). WGS measures DNA variation. Proteomics measures proteins. Knowing which level you're working at guides your choice of tools.

3Key concepts you'll encounter

Before starting any analysis, it helps to understand these fundamental concepts:

Reference genome

A standard DNA sequence representing a species. Your experimental reads are mapped back to this to find where they came from.

Read

A short DNA/RNA sequence (50–300 bp) produced by a sequencer. One experiment generates millions of reads.

Alignment

Finding where each read belongs on the reference genome. Like placing puzzle pieces back onto the picture on the box.

Gene expression

How active a gene is. Measured as number of RNA reads that map to that gene's location. More reads = more expression.

Databases

How to download sequences

Biological sequence data is stored in public databases. This guide covers the most important ones and exactly how to download sequences for your organism of interest.

1NCBI — find any sequence

The National Center for Biotechnology Information (NCBI) hosts the world's largest biological database. Start at ncbi.nlm.nih.gov

🔗
ncbi.nlm.nih.gov — NCBI homepage. Use the search bar to find sequences, genomes, papers, and raw sequencing data.

Key NCBI databases:

GenBank

Annotated nucleotide sequences. Search by gene name, accession number, or organism. Download as FASTA or GenBank format.

SRA (Sequence Read Archive)

Raw sequencing data from published experiments. Search by GEO dataset ID, paper title, or organism + technique.

RefSeq

Curated, non-redundant reference sequences. Better than raw GenBank for known genes. Accessions start with NM_, NR_, XM_.

Genome

Whole genome assemblies. Find reference genomes for any organism, download chromosome FASTA files.

2Download a gene sequence from NCBI

Step by step — finding and downloading a specific gene:

💡
Example: downloading the TP53 gene sequence from human

1. Go to ncbi.nlm.nih.gov/gene

2. Search: TP53[gene] AND "Homo sapiens"[orgn]

3. Click the top result → scroll to "RefSeq Status" section → click the NM_ accession

4. On the sequence page: click Send to → File → FASTA → Create file

For command-line download:

# Install Entrez Direct tools (once) sh -c "$(curl -fsSL https://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/install-edirect.sh)" # Download a gene by accession number efetch -db nuccore -id NM_000546 -format fasta > TP53_mRNA.fa # Download protein sequence efetch -db protein -id NP_000537 -format fasta > TP53_protein.fa
3Download genome FASTA from Ensembl
🌐
Ensembl is the best source for complete annotated genomes — especially for model organisms.
# Human genome (GRCh38) curl -L -O ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # Mouse genome (GRCm39) curl -L -O ftp://ftp.ensembl.org/pub/release-109/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz # Arabidopsis (TAIR10) — from Ensembl Plants curl -L -O ftp://ftp.ensemblgenomes.ebi.ac.uk/pub/plants/release-56/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz # Decompress gunzip *.fa.gz
4Download RNA-seq data (SRA)
⚠️
SRA data requires the SRA toolkit (fasterq-dump). Install it via conda: conda install -c bioconda sra-tools
# Find accession on ncbi.nlm.nih.gov/sra # Search: your paper/experiment → click Run → copy SRR number # Download a single sample (single-end) fasterq-dump SRR1234567 --outdir ./raw_data --progress # Download paired-end data fasterq-dump SRR1234567 --outdir ./raw_data --progress --split-files # Check what you downloaded ls -lh ./raw_data/
Databases

NCBI SRA — Download RNA-seq Data

The Sequence Read Archive contains raw sequencing data from hundreds of thousands of published experiments. This tutorial shows you exactly how to find and download data for your analysis.

1Find your experiment on SRA

Three ways to find SRA data for your project:

🔍
From a paper: Look in the "Data availability" or "Methods" section. Authors must state accession numbers (GEO: GSE123456 or SRA: SRP123456)
🔍
Search by organism + technique: Go to ncbi.nlm.nih.gov/sra and search e.g. "Arabidopsis thaliana RNA-seq drought stress"
🔍
From GEO: ncbi.nlm.nih.gov/geo → find dataset → click "SRA Run Selector" to get individual run accessions
2Get the SRR accession numbers

Each sequencing run has a unique accession number. Formats: SRR (NCBI), ERR (European ENA), DRR (Japanese DDBJ)

From the SRA Run Selector page: tick the samples you need → click "Accession List" to download a file with all accession numbers at once.

# If you have a file of accession numbers (SRR_Acc_List.txt): while read acc; do fasterq-dump $acc --outdir ./raw_data --progress done < SRR_Acc_List.txt
3Check if data is single-end or paired-end

This matters for your analysis — wrong setting = garbage results.

# Check layout: look at the SRA run page under "Layout" # Or use fastq-dump with --info flag: fastq-dump -I --skip-technical --readids --read-filter pass \ --dumpbase --split-3 --clip -N 0 -X 1 SRR1234567 # After fasterq-dump: # Single-end: creates SRR1234567.fastq (one file) # Paired-end: creates SRR1234567_1.fastq and SRR1234567_2.fastq (two files)
Databases

NCBI — What is it?

The National Center for Biotechnology Information is the world's largest biological database. Everything lives here — genes, genomes, proteins, papers, and raw sequencing data.

🗄️
GenBank

Annotated nucleotide sequences — individual genes, mRNA, genomes. The original biological sequence database.

ncbi.nlm.nih.gov/genbank
📦
SRA (Sequence Read Archive)

Raw sequencing reads from published studies. Millions of RNA-seq, WGS, ChIP-seq datasets available free.

ncbi.nlm.nih.gov/sra
🧬
Gene

Information about known genes — function, location, synonyms, reference sequences, pathways.

ncbi.nlm.nih.gov/gene
🌐
GEO (Gene Expression Omnibus)

Processed gene expression datasets from microarray and RNA-seq experiments. Great for finding existing datasets.

ncbi.nlm.nih.gov/geo
🔬
Protein

Protein sequences — linked to GenBank, RefSeq, UniProt. Search by name, accession, or BLAST similarity.

ncbi.nlm.nih.gov/protein
🏗️
Genome

Complete genome assemblies for thousands of species. Download FASTA, GFF, GTF annotation files.

ncbi.nlm.nih.gov/genome
Reference

Bioinformatics File Formats

Each step of an analysis produces a different file format. Understanding them prevents confusion about what tools expect as input and what they produce as output.

.fastq / .fq
Raw sequencing reads

4 lines per read: ID, sequence, separator, quality scores. Direct output from sequencing machines. Input for alignment.

.fasta / .fa
Reference sequences

2 lines per sequence: header line starting with >, then the nucleotide or amino acid sequence. Used for genomes and gene sequences.

.sam / .bam / .cram
Alignment files

SAM = text alignment. BAM = compressed binary SAM (5× smaller). CRAM = even more compressed. Output from HISAT2, BWA, STAR.

.gtf / .gff / .gff3
Gene annotation

Defines where genes are on the genome — start, end, exon positions. Required for featureCounts, StringTie, and DESeq2 input.

.vcf / .bcf
Variant calls

Lists SNPs, INDELs, and structural variants. Output from GATK, FreeBayes. BCF is the binary (compressed) version.

.bed
Genomic intervals

Tab-separated: chromosome, start, end. Used for peaks (ChIP-seq), regulatory regions, any genomic coordinate list.

Reference

Key Bioinformatics Databases

A curated list of the most important databases every bioinformatician should know — with links and brief descriptions of what each one is used for.

Sequence & Genome
🇺🇸
NCBI

GenBank, SRA, Gene, GEO, RefSeq, BLAST. The most comprehensive source for biological sequences.

ncbi.nlm.nih.gov
🇪🇺
Ensembl

Annotated genomes for vertebrates and other organisms. Best for reference genome + GTF downloads.

ensembl.org
🌱
Ensembl Plants

Ensembl for plant genomes — Arabidopsis, rice, maize, wheat, soybean, tomato and dozens more.

plants.ensembl.org
🧬
UniProt

Protein sequences and functional annotation. Swiss-Prot (curated) and TrEMBL (automated) sections.

uniprot.org
Pathways & Function
🗺️
KEGG

Metabolic and signalling pathways for all organisms. Use for pathway enrichment after DESeq2.

kegg.jp
🔵
Gene Ontology (GO)

Standardised vocabulary for gene function. GO enrichment analysis reveals biological processes in DE gene lists.

geneontology.org
🔗
STRING

Protein interaction networks. Upload your DE gene list to see which proteins work together.

string-db.org
🌿
PlantTFDB

Plant transcription factor database. All TF families for all sequenced plant genomes.

planttfdb.gao-lab.org
Getting started

Setting Up Your Environment

Before running any bioinformatics analysis you need a few core tools installed. This guide covers macOS, Linux, and Windows (WSL2) setup from scratch.

🚀
The easiest way to get started is to use RNAflow — it has a built-in system checker that detects old software and provides fix commands automatically.
1Install Conda (Miniforge)
# macOS / Linux — install Miniforge brew install miniforge # macOS with Homebrew conda init zsh # initialise for zsh shell # Close and reopen terminal, then: conda create -n bioinfo python=3.10 -y conda activate bioinfo
2Install core bioinformatics tools
conda install -c bioconda -c conda-forge \ fastqc fastp hisat2 samtools subread sra-tools -y
Getting started

Using the Terminal

The terminal is a text-based interface to your computer. Most bioinformatics tools run in the terminal. This guide covers the essential commands you'll use every day.

1Essential commands
pwd # Print working directory — where am I? ls # List files in current folder ls -lh # List with file sizes cd foldername # Change directory (go into a folder) cd .. # Go up one folder mkdir myfolder # Create a new folder cp file1 file2 # Copy a file mv file1 file2 # Move or rename a file rm filename # Delete a file (careful! no recycle bin) head -10 file # Show first 10 lines of a file wc -l file # Count lines in a file
Databases

Ensembl — Genome Downloads

Ensembl provides complete, well-annotated genome assemblies for hundreds of species. It is the recommended source for reference genomes and GTF annotation files used in RNA-seq alignment.

1Finding the right FTP URL

Navigate to ensembl.org → your organism → scroll to "Download" → click "Download DNA sequence (FASTA)" → right-click the file ending in .dna.toplevel.fa.gz or .dna.primary_assembly.fa.gz → copy link address.

💡
Use primary_assembly for human and mouse (excludes patches). Use toplevel for smaller genomes (includes everything).
RNA-seq

What is RNA-seq?

RNA sequencing measures gene expression — which genes are active in a cell at a given moment and by how much. This guide explains the concept, the biology, and the data before you start any analysis.

1The biological question RNA-seq answers

Every cell contains the same DNA but different cells express different genes — a liver cell and a neuron are so different because they use different subsets of their genes. RNA-seq captures a snapshot of which genes are being read (transcribed to mRNA) at a specific moment. Compare two conditions (treated vs untreated, mutant vs wildtype) and you discover which genes changed — the molecular basis of the difference.

2The full pipeline in one diagram
🧬
Cells → extract RNA → make cDNA library → sequence millions of fragments → map back to genome → count per gene → statistical test → biological insight

The BioInfoCodex RNAflow app guides you through every step of this pipeline with copy-ready commands and live execution.

RNA-seq

Quality Control with FastQC

Before any analysis, always check the quality of your raw sequencing data. FastQC produces a visual HTML report for each sample in about 2 minutes. This tutorial explains what each module means and what to do when something fails.

🚀
RNAflow runs FastQC automatically and explains every result in context.
RNA-seq

Alignment Concepts

Alignment (or mapping) is the process of finding where each sequencing read originated on the reference genome. Understanding the basics helps you interpret alignment statistics and troubleshoot low mapping rates.

📖
HISAT2 is the aligner used in RNAflow. It is splice-aware, meaning it correctly handles reads that span exon-exon junctions — critical for RNA-seq data.
RNA-seq

DESeq2 — The Statistics

DESeq2 uses a negative binomial statistical model to identify genes that are differentially expressed between conditions, accounting for biological variation between replicates.

⚠️
DESeq2 requires minimum 2 biological replicates per condition. 3+ is strongly recommended. Never compare single samples.
Reference

Bioinformatics Glossary

Quick definitions for the most common terms you'll encounter.

FASTQ

File format storing raw sequencing reads with quality scores. Each read = 4 lines.

Phred score

Quality score per base. Q30 = 99.9% accuracy. Q20 = 99% accuracy.

Adapter

Synthetic DNA sequence added during library prep. Must be removed before alignment.

BAM file

Binary Alignment Map. Compressed version of SAM. Contains where each read mapped on the genome.

GTF / GFF

Gene annotation file. Tells featureCounts where genes start and end on each chromosome.

Normalisation

Adjusting counts to account for sequencing depth differences between samples. TPM, RPKM, DESeq2 VST.

Log2 fold change

How much a gene changes. LFC=1 means 2× higher. LFC=-1 means 2× lower. LFC=0 = no change.

Adjusted p-value (padj)

P-value corrected for multiple testing (Benjamini-Hochberg). Always use padj, never raw p-value.

Contact

Get in touch

Questions about a tool, a tutorial idea, a bug report, or want to collaborate? We'd love to hear from you.

Contact information

✉️
🐦
Twitter / X
📍
Location
Open-source project — contributors worldwide

Follow BioInfoCodex

Frequently asked questions

Quick answers before you send a message.

Is BioInfoCodex really free forever?
Yes. BioInfoCodex is open-source software released under the MIT licence. All tools and tutorials are and will remain free. There is no premium tier, no subscription, and no paywalled features.
Does my data get uploaded anywhere?
No. All computation happens on your own computer. The HTML app and Python server communicate only via localhost (127.0.0.1) — your data never leaves your machine. There is no server, no cloud, no analytics on your files.
Can I contribute a tool or tutorial?
Absolutely. We welcome contributions from researchers and developers. Open a pull request on GitHub, or send us an email describing what you'd like to add. All contributors are credited.
I found a bug — how do I report it?
The best way is to open a GitHub issue with a description of what you did, what you expected, and what happened (include any error messages). You can also use the contact form and we'll log it.
Can I cite BioInfoCodex in my paper?
We are preparing a manuscript for submission. In the meantime, you can cite the GitHub repository URL and version number. We'll post the citation format as soon as the preprint is on bioRxiv.

Send a message

Bug reports, feature requests, collaboration proposals, or just to say hello — all welcome.

Message sent!
We'll get back to you within 48 hours.