Binning metagenomic contigs by coverage and composition using CONCOCT

Alternative workflow to generate the coverage table

Sometimes we are interested in generating a coverage table from the sequence files where we have lost the paired-end information, for example, in scenarios where we are dealing with human associated microbes after having filtered out our samples for human contaminants using DeconSeq. In such a case, the following steps may be useful for generating a coverage table. Please go through the Linux command line exercises for NGS data processing tutorial to understand BWA, SAMtools, and BEDtools better.
To start with will organise our data in such a manner that there is a folder for every sample and within each folder there is a Raw folder containing untrimmed paired-end FASTQ files:

Sample1/Raw/*_R1.fastq
Sample1/Raw/*_R2.fastq
Sample2/Raw/*_R1.fastq
Sample2/Raw/*_R2.fastq
Sample3/Raw/*_R1.fastq
Sample3/Raw/*_R2.fastq
Sample4/Raw/*_R1.fastq
Sample4/Raw/*_R2.fastq
Sample5/Raw/*_R1.fastq
Sample5/Raw/*_R2.fastq

Step 1: We trim the reads using sickle and generate *R1_trimmed.fastq and *R2_trimmed.fastq file in Raw folder:

$ for i in $(ls -d *); do cd $i; echo processing $i; sickle pe -f Raw/*_R1.fastq -r Raw/*_R2.fastq -o $i.R1_trimmed.fastq -p $i.R2_trimmed.fastq -s $i.singlet.fastq -t "sanger" -q 20 -l 10; cd ..; done

Step 2: We convert the FASTQ files to FASTA files using PRINSEQ. The resulting files will have the name *_prinseq_good_*.fasta

$ for i in $(ls -d *); do cd $i; echo processing $i; prinseq-lite.pl -fastq $i.R1_trimmed.fastq -out_format 1;prinseq-lite.pl -fastq $i.R2_trimmed.fastq -out_format 1; cd ..; done

Step 3: We collate the forward and reverse reads together and generate a collated FASTA file

$ for i in $(ls -d *); do cd $i; echo processing $i; cat *R1_trimmed_prinseq*.fasta *R2_trimmed_prinseq*.fasta > ${i}_collated.fasta; cd ..; done

Step 4: We remove human contaminants using DeconSeq which will generate two files *clean*.fa and *cont*.fa.

$ for i in $(ls -d *); do cd $i; cd Raw; echo processing $i; perl /home/opt/deconseq-standalone-0.4.3/deconseq.pl -dbs hsref -out_dir . -f *_collated.fasta; cd ..; cd ..; done

Step 5: Assembling reads. We can then pool all the sample reads together

$ cat */*clean.fa > /path_to_assembly_folder/pooled_samples.fa

We can then use idba-ud to assemble the reads

$ cd /path_to_assembly_folder
$ idba_ud -l pooled_samples.fa -o collated_assembly --mink 21 --maxk 121 --num_threads 20 --pre_correction

Say we get the best assembly done for kmer 121, then we will have a file contig-121.fa in the collated_assembly folder. We need to extract only those contigs that have length > 1000. You can get the following script from here.

$ perl ~/bin/contig_size_select.pl -low 1000 -high 10000000 contig-121.fa > filtered_contigs.fa

Step 6: We map the reads back to contigs and generate SAM files:

$ for i in $(ls -d *); do cd $i;echo processing $i;bwa mem /path_to_assembly_folder/collated_assembly/filtered_contigs.fa Raw/*clean*.fa > $i.sam; cd ..; done

Step 7: We convert the SAM files to BAM files:

$ for i in $(ls -d *); do cd $i;echo processing $i;samtools view -h -b -S $i.sam > $i.bam; cd ..; done

Step 8: We extract mapped reads:

$ for i in $(ls -d *); do cd $i;echo processing $i;samtools view -b -F 4 $i.bam > $i.mapped.bam; cd ..; done

Step 9: We need to generate the lengths.genome in each folder which will be used by bedtools to generate the coverage information

$ for i in $(ls -d *); do cd $i;echo processing $i;samtools view -H $i.mapped.bam | perl -ne 'if ($_ =~ m/^\@SQ/) { print $_ }' | perl -ne 'if ($_ =~ m/SN:(.+)\s+LN:(\d+)/) { print $1, "\t", $2, "\n"}' > lengths.genome ; cd ..; done

Step 10: We also need to sort the BAM files:

$ for i in $(ls -d *); do cd $i;echo processing $i;samtools sort -m 1000000000 $i.mapped.bam $i.mapped.sorted ; cd ..; done

Step 11: Now generate the coverage information for each sample:

$ for i in $(ls -d *); do cd $i;echo processing $i;bedtools genomecov -ibam $i.mapped.sorted.bam -g lengths.genome  > ${i}_coverage.txt; cd ..; done

Step 12: Generate coverage information in csv format that we will use with my perl script to generate a single table

$ for i in $(ls -d *); do cd $i;echo processing $i; awk -F"\t" '{l[$1]=l[$1]+($2 *$3);r[$1]=$4} END {for (i in l){print i","(l[i]/r[i])}}' ${i}_coverage.txt > ${i}_coverage.csv; cd ..; done

Step 13: In the main folder, collate the coverage tables from individual samples together:

$ perl ~/bin/collateResults.pl -f . -p _coverage.csv > coverage_table.csv

Binning metagenomic contigs by coverage and composition using `CONCOCT`

Reading material

Initial setup and datasets for this tutorial

Assembling metagenomic reads

Cutting up contigs

Map the reads onto the contigs

Generate coverage table

Alternative workflow to generate the coverage table

Generate linkage table

Run `CONCOCT`

Evaluate output

Validation using single-copy core genes

Incorporating linkage information

Binning metagenomic contigs by coverage and composition using CONCOCT

Reading material

Initial setup and datasets for this tutorial

Assembling metagenomic reads

Cutting up contigs

Map the reads onto the contigs

Generate coverage table

Alternative workflow to generate the coverage table

Generate linkage table

Run CONCOCT

Evaluate output

Validation using single-copy core genes

Incorporating linkage information

Binning metagenomic contigs by coverage and composition using `CONCOCT`

Run `CONCOCT`