SNP Calling Workflow

by Cosmika Goswami and Umer Zeeshan Ijaz

This is a workflow to detect SNPs from whole genome sequencing data. Here we have Clostridium Difficile strain 078 genomic samples, sequenced through Illumina MiSeq to obtain 300bp long pair-end reads. The basic steps of the workflow is as follows:

1. Arrange Paired-end Reads

Step 1(a): We have 3 samples of Cdiff078 with 61 contigs and paired-end reads.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ ls data/
Step 1(b): Arrange the sequence files according to pairs for each sample, creating 3 folders with a 'Raw' subfolder within each of them.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ for i in $(awk -F"L001" '{gsub("_$","",$1);print $1}' <(ls data/*.fastq.gz) | sort | uniq); do mkdir $(basename $i); mkdir $(basename $i)/Raw; cp $i*.fastq.gz $(basename $i)/Raw/. ; done
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ ls -d sample*
sample1  sample2  sample3

2. Quality Trimming

We will do the quality trimming of the raw Illumina reads using 'sickle'. Sickle is a tool that uses sliding windows along with quality and length thresholds to determine when quality is sufficiently low to trim the 3'-end of reads and also determines when the quality is sufficiently high enough to trim the 5'-end of reads. It gives 3 files - forward reads, backward reads and singles. With filter of minimum phred-quality score 20 and minimum read size as 10, create forward and backward pair-end FASTQ files.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ for i in $(ls -d sample*/); do echo $i; cd $i; ls Raw; sickle pe -f Raw/*_R1_001.fastq.gz -r Raw/*_R2_001.fastq.gz -t sanger -o trimmed.$(basename ${i})_R1.fastq -p trimmed.$(basename ${i})_R2.fastq -s singles.$(basename ${i}).file.fastq -q 20 -l 10 ; cd ..; done
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ ls -d sample*/*.fastq
sample1/singles.sample1.file.fastq  sample1/trimmed.sample1_R1.fastq  sample1/trimmed.sample1_R2.fastq
sample2/singles.sample2.file.fastq  sample2/trimmed.sample2_R1.fastq  sample2/trimmed.sample2_R2.fastq
sample3/singles.sample3.file.fastq  sample3/trimmed.sample3_R1.fastq  sample3/trimmed.sample3_R2.fastq

3. Sequence Alignment and Quality Check

Step 3(a): First we will Index the reference FASTA file of Clostridium Difficile 078, using BWA.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ bwa index Cdiff078.fa 
Step 3(b): Align each of the sample pair-end reads to reference datbase with 'bwa mem' and sorted the output aligned file using 'samtools sort'. The final output files are in bam format. The -M flag in bwa mem makes the bam file picard compatibility.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ for i in $(ls -d sample*/); do echo $i; cd $i; bwa mem -M ../Cdiff078.fa trimmed.$(basename ${i})_R1.fastq trimmed.$(basename ${i})_R2.fastq | samtools view -hbS - | samtools sort -m 1000000000  -  $(basename ${i});cd ..;done
Step 3(c): Now, we will assess the quality of mapped reads by generating charts for GC Bias, Mean Quality by Cycle, and Quality Score Distribution
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ for i in $(ls -d sample*/); do echo $i; cd $i; java -Xmx2g -jar $(which CollectGcBiasMetrics.jar) R=../Cdiff078.fa I=$(basename ${i}).bam O=$(basename ${i})_GCBias.txt CHART=$(basename ${i})_GCBias.pdf ASSUME_SORTED=true; cd ..;done
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ for i in $(ls -d sample*/); do echo $i; cd $i; java -Xmx2g -jar $(which MeanQualityByCycle.jar) R=../Cdiff078.fa I=$(basename ${i}).bam O=$(basename ${i})_Qcycle.txt CHART=$(basename ${i})_Qcycle.pdf; cd ..;done
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ for i in $(ls -d sample*/); do echo $i; cd $i; java -Xmx2g -jar $(which QualityScoreDistribution.jar) R=../Cdiff078.fa I=$(basename ${i}).bam O=$(basename ${i})_Qdist.txt CHART=$(basename ${i})_Qdist.pdf; cd ..;done

4. Mark and Remove Duplicates

To identify and remove duplicate reads, we use Picard Tools. First we will identify (Mark) the duplicate reads, remove them and re-align the bam file.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ for i in $(ls -d sample*/); do echo $i; cd $i; ls $(basename ${i}).bam ;java -jar $(which MarkDuplicates.jar) INPUT= $(basename ${i}).bam OUTPUT= $(basename ${i}).rmdup.bam METRICS_FILE= duplicateMatrix REMOVE_DUPLICATES=true;java -jar $(which AddOrReplaceReadGroups.jar) I=$(basename ${i}).rmdup.bam O=$(basename $i).rmdup.addgp.bam LB=whatever PL=illumina PU=whatever SM=whatever;samtools index $(basename ${i}).rmdup.addgp.bam; cd ..;done 

5. Indel Realignment

Indel Realignment is performed to realign the reads locally to correct misalignments due to the presence of indels which can be mistaken as SNP. We are using Picard Tools for this. First we will generate a dictionary file for the database, identify the regions to be realigned and then fix the bam file.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ samtools faidx Cdiff078.fa
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ if [ ! -f  "Cdiff078.dict" ]; then java -jar $(which CreateSequenceDictionary.jar) R=Cdiff078.fa O=Cdiff078.dict; fi
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ for i in $(ls -d sample*/); do echo $i;cd $i; java -Xmx2g -jar $(which GenomeAnalysisTK.jar) -T RealignerTargetCreator -R ../Cdiff078.fa -I $(basename ${i}).rmdup.addgp.bam -o $(basename ${i}).rmdup.addgp.intervals;java -Xmx4g -jar $(which GenomeAnalysisTK.jar) -I $(basename ${i}).rmdup.addgp.bam -R ../Cdiff078.fa -T IndelRealigner -targetIntervals $(basename ${i}).rmdup.addgp.intervals  -o $(basename ${i}).realigned.bam;cd ..;done

6. SNPs and Indels Detection

To detect variants we will align all 3 sample bam files with 'samtools mpileup' and will use VarScan software. Samtools acquires all the information from the bam files and regroups the reads by calculating genotype likelihood. The VarScan is a tool for variant detection in next-generation sequence data using parallel sequencing. It takes an aligned file, scores and sorts the alignments on a per-read basis, discarding reads that aligned with low identity or to multiple (ambiguous) locations in the reference sequence and gives the SNPs and Indels.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ samtools mpileup -B -f Cdiff078.fa  sample1/sample1.realigned.bam sample2/sample2.realigned.bam sample3/sample3.realigned.bam | java -jar VarScan.v2.3.7.jar mpileup2cns --min-coverage 2 --min-var-freq 0.8 --p-value 0.005 --variants --output-vcf > variants.vcf
[mpileup] 1 samples in 3 input files
mpileup Set max per-file depth to 8000
Only variants will be reported
Min coverage:	2
Min reads2:	2
Min var freq:	0.8
Min avg qual:	15
P-value thresh:	0.005
Reading input from STDIN
3835530 bases in pileup file
5018 variant positions (4607 SNP, 411 indel)
88 were failed by the strand-filter
4930 variant positions reported (4523 SNP, 407 indel)
The variants file is in Variant Calling Format or vcf format.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ head -30 variants.vcf 
##INFO=ID=ADP,Number=1,Type=Integer,Description="Average per-sample depth of bases with Phred score >= 15">
##INFO=ID=WT,Number=1,Type=Integer,Description="Number of samples called reference (wild-type)">
##INFO=ID=HET,Number=1,Type=Integer,Description="Number of samples called heterozygous-variant">
##INFO=ID=HOM,Number=1,Type=Integer,Description="Number of samples called homozygous-variant">
##INFO=ID=NC,Number=1,Type=Integer,Description="Number of samples not called">
##FILTER=ID=str10,Description="Less than 10% or more than 90% of variant supporting reads on one strand">
##FILTER=ID=indelError,Description="Likely artifact due to indel reads at this position">
##FORMAT=ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=ID=SDP,Number=1,Type=Integer,Description="Raw Read Depth as reported by SAMtools">
##FORMAT=ID=DP,Number=1,Type=Integer,Description="Quality Read Depth of bases with Phred score >= 15">
##FORMAT=ID=RD,Number=1,Type=Integer,Description="Depth of reference-supporting bases (reads1)">
##FORMAT=ID=AD,Number=1,Type=Integer,Description="Depth of variant-supporting bases (reads2)">
##FORMAT=ID=FREQ,Number=1,Type=String,Description="Variant allele frequency">
##FORMAT=ID=PVAL,Number=1,Type=String,Description="P-value from Fisher's Exact Test">
##FORMAT=ID=RBQ,Number=1,Type=Integer,Description="Average quality of reference-supporting bases (qual1)">
##FORMAT=ID=ABQ,Number=1,Type=Integer,Description="Average quality of variant-supporting bases (qual2)">
##FORMAT=ID=RDF,Number=1,Type=Integer,Description="Depth of reference-supporting bases on forward strand (reads1plus)">
##FORMAT=ID=RDR,Number=1,Type=Integer,Description="Depth of reference-supporting bases on reverse strand (reads1minus)">
##FORMAT=ID=ADF,Number=1,Type=Integer,Description="Depth of variant-supporting bases on forward strand (reads2plus)">
##FORMAT=ID=ADR,Number=1,Type=Integer,Description="Depth of variant-supporting bases on reverse strand (reads2minus)">
#CHROM		POS	ID	REF	ALT	QUAL	FILTER		INFO				FORMAT							Sample1							Sample2							Sample3
Cdiff078_C01	7967	.	AT	A	.	PASS	ADP=50;WT=0;HET=0;HOM=3;NC=0	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	1/1:255:48:47:1:47:97.92%:7.6145E-27:34:35:0:1:22:25	1/1:255:51:51:1:50:98.04%:1.3013E-28:34:37:0:1:19:31	1/1:255:53:53:0:53:100%:1.5943E-31:0:35:0:0:24:29
Cdiff078_C01	7982	.	GT	G	.	PASS	ADP=52;WT=0;HET=0;HOM=3;NC=0	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	1/1:243:45:45:1:44:97.78%:4.4304E-25:37:35:1:0:19:25	1/1:255:54:54:1:53:98.15%:2.2126E-30:34:36:0:1:21:32	1/1:255:58:58:1:57:98.28%:9.6072E-33:26:34:1:0:26:31
Cdiff078_C02	312	.	G	A	.	PASS	ADP=39;WT=0;HET=0;HOM=3;NC=0	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	1/1:146:26:26:0:26:100%:2.0165E-15:0:35:0:0:14:12	1/1:236:41:41:0:41:100%:2.3541E-24:0:33:0:0:21:20	1/1:255:50:50:0:50:100%:9.9117E-30:0:35:0:0:27:23
Cdiff078_C02	6822	.	GA	G	.	PASS	ADP=63;WT=0;HET=0;HOM=3;NC=0	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	1/1:255:65:65:2:63:96.92%:2.3257E-35:35:37:1:1:26:37	1/1:255:57:57:1:56:98.25%:3.7452E-32:20:36:1:0:28:28	1/1:255:68:68:0:68:100%:1.6809E-40:0:36:0:0:32:36
Cdiff078_C02	14407	.	G	A	.	PASS	ADP=47;WT=0;HET=0;HOM=3;NC=0	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	1/1:194:34:34:0:34:100%:3.5146E-20:0:36:0:0:14:20	1/1:255:49:49:0:49:100%:3.925E-29:0:36:0:0:26:23	1/1:255:60:60:0:60:100%:1.035E-35:0:36:0:0:25:35
Cdiff078_C02	17978	.	C	CT	.	PASS	ADP=48;WT=0;HET=0;HOM=3;NC=0	GT:GQ:SDP:DP:RD:AD:FREQ:PVAL:RBQ:ABQ:RDF:RDR:ADF:ADR	1/1:255:48:48:2:46:95.83%:1.9036E-25:33:36:0:2:26:20	1/1:255:49:49:0:49:100%:3.925E-29:0:36:0:0:27:22	1/1:247:48:48:2:46:95.83%:1.9036E-25:34:34:1:1:21:25

8. Filter Recombination

Recent Recombination events can be detected by looking at the tight clustering of SNPs. We will identify exclude those SNPs from phylogenetic analysis as they could mask the true phylogenetic signal. Step 8(a): First we will divide the vcf file according to its 61 contigs
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ for i in $(grep -v "#" variants.vcf| cut -f1|sort|uniq); do grep -i -e "$i" -e "#CHROM" variants.vcf> $i.vcf;done
Step 8(b): We will count SNP density for each 61 contigs, with a moving window of 200bp long, using the python script
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ for i in $(ls Cdiff078_*.vcf); do echo $i; python -vcfs ${i} --win 200 ; done
step 8(c): Collate all the snp density files together
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ for i in $(ls *snp_densitywin200.csv); do awk '{n=split(FILENAME,a,".");print a[1]"," $0}' $i | sed '1d' ;done> contigs_snps_density.csv
Step 8(d): Now we will plot the snp desnity using the following R script:
density<- read.csv("contigs_snps_density.csv", header=FALSE, sep=",")
colnames(density)<- c("contigs","pos","count")
p1<- qplot(x=density$pos,y=density$count, color=density$contigs) +guides(col = guide_legend(title="Contigs",nrow = 18))
p <- p1+ labs(list(title="SNP density Plot", x= "Base Position", y="Density"))
ggsave("snp_density.pdf", plot=p, width =9)

Step 8(e): Filter the variants at a density level 5 to avoid false positives due to recombination.

[cosmika@quince-srv2 ~/snp_calling_tutorial]$ python --use-density --density 5 --window 200 variants.vcf > variants.filtered.vcf
Step 8(f): Number of SNPs after filtering are
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ vcftools --vcf variants.filtered.vcf --remove-indels
VCFtools - v0.1.12a
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf variants.filtered.vcf

After filtering, kept 3 out of 3 Individuals
After filtering, kept 21 out of a possible 32 Sites
Run Time = 0.00 seconds
Step 8(g): We have metadata for these samples
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ cat metadata.txt
Sample	Gender	Location	Region	Year
ref	ref	ref		ref	ref
Sample1	Male	Highland	North	2009
Sample2	Male	Glasgow 	West	2009
Sample3	Female	Glasgow 	West	2008
Step 8(h): Convert the filtered snps into a psudo-sequence using the python script ''. It gives the 'output-prefix'.all.alleles.fasta file in FASTA format.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ python --metadata metadata.txt --output_prefix snps  variants.filtered.vcf
The FASTA file is
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ cat snps_all_alleles.fasta 

9. Phylogenetic Tree Build using FastTree

Step 9: Use FastTree to get the phylogenetic tree in newick format. It takes sequence alignment file in FASTA format. The Flag -nt stands for nucleotide (not protein) alignment.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ FastTree -nt  snps_all_alleles.fasta > snps.fastree.nwk
Step 10(a): FigTree is a graphical viewer of the phylogenetic tree. It takes a tab seperated metadata file with coloumn header and a tree file in newick format.

Step 10(b): Another graphical viewer of phylogenetic tree is 'PhyloBar'

10. Filter Recombination and Tree Build using Gubbins

Step 11: Gubbins is an algorithm that iteratively identifies recombination regions while simulaniusly constructing Maximum Likelihood phylogeny based on the putative point mutations outside of these recombination regions. It takes an FASTA alignment file and gives an output phylogenetic tree in newick format. First, to construct the FASTA alignment input file for Gubbins, we will filter out the snps using VCFtools from the VARScan output vcf file. VCFtools can perform analyses on the variants that pass through the filters or simply write those variants out to a new file. We can create subsets of VCF files or just removing unwanted variants from VCF files. To write out the variants that pass through filters use the --recode option. In addition, use --recode-INFO-all to include all data from the INFO fields in the output. It will generate a file 'out-prefix'.recode.vcf in the current folder.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ vcftools --vcf variants.vcf --out snps_only.variants --remove-indels --recode --recode-INFO-all
Step 11(b): We then convert the snp vcf file into a FASTA file of SNP psudo-sequences as above.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ python -m  metadata.txt -o snps_only.variants snps_only.variants.recode.vcf
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ cat snps_only.variants_all_alleles.fasta
Step 11(c): Here we create a virtual python 2.7.8 environment, to run Gubbins.
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ bash
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ python -V
Python 2.6.6
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ export PYENV_ROOT="/home/opt/.pyenv"
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ export PATH="$PYENV_ROOT/bin:$PATH"
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ eval "$(pyenv init -)"
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ export PATH=/home/opt/fastml/programs/fastml:$PATH
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ export PATH=/home/opt/standard-RAxML:$PATH
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ python -V
Python 2.7.8
Step 11(c): Now run Gubbins to generate phylogeny tree with FastTree (or RAxML optional).
[cosmika@quince-srv2 ~/snp_calling_tutorial]$ --tree_builder fasttree --prefix gubbins snps_only.variants_all_alleles.fasta
Step 12: Visualize the tree in FigTree.