Binning metagenomic contigs by coverage and composition using CONCOCT
by Umer Zeeshan Ijaz
Reading material
Please read the paper:
J Alneberg, BS Bjarnason, I de Bruijn, M Schirmer, J Quick, UZ Ijaz, L Lahti, NJ Loman, AF Andersson, and C Quince. Binning metagenomic contigs by coverage and composition. Nature Methods, 2014
and get the software from:
https://github.com/BinPro/CONCOCT
https://github.com/umerijaz/TAXAassign
Initial setup and datasets for this tutorial
On Windows machine (including teaching labs at the University), you will use mobaxterm (google it and download the portable version if you have not done so already) AND NOT putty to log on to the server as it has XServer built into it, allows X11 forwarding, and you can run the softwares that have a GUI. If you are on linux/Mac OS, you can use -Y or -X switch in the ssh command below to allow the same capability.
ssh -Y -p 22567 MScBioinf@quince-srv2.eng.gla.ac.uk
When you have successfully logged on, create a folder CTutorial_YourName
, go inside it and do all your processing there!
[MScBioinf@quince-srv2 ~]$ mkdir CTutorial_Umer
[MScBioinf@quince-srv2 ~]$ cd CTutorial_Umer/
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ ls
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$
The latest version of CONCOCT
requires Python 2.7.x to run. Since the version of python on the server is 2.6.6, we will first create a new bash
shell, load pyenv
and switch to Python 2.7.8.
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ bash
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ python -V
Python 2.6.6
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ export PYENV_ROOT="/home/opt/.pyenv"
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ export PATH="$PYENV_ROOT/bin:$PATH"
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ eval "$(pyenv init -)"
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ python -V
Python 2.7.8
Next, we will set the environmental variables to do this exercise: CONCOCT
to point to the software, CONCOCT_TEST
to the test data repository, and CONCOCT_EXAMPLE
to the folder where you will do all the processing
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ export CONCOCT=/home/opt/CONCOCT
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ export CONCOCT_TEST=/home/opt/CONCOCT_TUTORIAL/CONCOCT-test-data-0.3.2
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ export CONCOCT_EXAMPLE=~/CTutorial_Umer
Assembling metagenomic reads
The first step in the analysis is to assemble all reads into contigs, here we use the software Velvet
for this. This step can be computationaly intensive but for this small data set comprising a synthetic community of four species and 16 samples (100,000 reads per sample) it can be performed in a few minutes. If you do not wish to execute this step, the resulting contigs are already in the test data repository, and you can copy them from there instead. The commands for running Velvet
are:
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cd $CONCOCT_EXAMPLE
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cat $CONCOCT_TEST/reads/Sample*_R1.fa > All_R1.fa
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cat $CONCOCT_TEST/reads/Sample*_R2.fa > All_R2.fa
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ velveth velveth_k71 71 -fasta -shortPaired -separate All_R1.fa All_R2.fa
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ velvetg velveth_k71 -ins_length 400 -exp_cov auto -cov_cutoff auto
After the assembly is finished create a directory with the resulting contigs and copy the result of Velvet
there (this output is also in $CONCOCT_TEST/contigs
)
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ mkdir contigs
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cp velveth_k71/contigs.fa contigs/velvet_71.fa
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ rm All_R1.fa
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ rm All_R2.fa
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ ls -1
contigs
velveth_k71
Cutting up contigs
In order to give more weight to larger contigs and mitigate the effect of assembly errors we cut up the contigs into chunks of 10 Kb. The final chunk is appended to the one before it if it is < 10 Kb to prevent generating small contigs. This means that no contig < 20 Kb is cut up. We use the script cut_up_fasta.py
for this:
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cd $CONCOCT_EXAMPLE
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ python $CONCOCT/scripts/cut_up_fasta.py -c 10000 -o 0 -m contigs/velvet_71.fa > contigs/velvet_71_c10K.fa
Map the reads onto the contigs
After assembly we map the reads of each sample back to the assembly using bowtie2
and remove PCR duplicates with MarkDuplicates
. The coverage histogram for each BAM file is computed with genomeCoverageBed
from bedtools
. The script that calls these programs is provided with CONCOCT
.
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ export MRKDUP=$(which MarkDuplicates.jar)
It is typically located within your picard-tools installation.
The following command is to be executed in the $CONCOCT_EXAMPLE
directory you created in the previous part. First create the index on the assembly for bowtie2
:
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cd $CONCOCT_EXAMPLE
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ bowtie2-build contigs/velvet_71_c10K.fa contigs/velvet_71_c10K.fa
Then run this for loop, which for each sample creates a folder and runs map-bowtie2-markduplicates.sh
:
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ for f in $CONCOCT_TEST/reads/*_R1.fa; do mkdir -p map/$(basename $f); cd map/$(basename $f); bash $CONCOCT/scripts/map-bowtie2-markduplicates.sh -ct 1 -p '-f' $f $(echo $f | sed s/R1/R2/) pair $CONCOCT_EXAMPLE/contigs/velvet_71_c10K.fa asm bowtie2; cd ../..; done
[uzi@quince-srv2 ~/CTutorial_Umer]$ ls -1
contigs
map
velveth_k71
The parameters used for map-bowtie2-markduplicates.sh
are:
-c option to compute coverage histogram with genomeCoverageBed
-t option is number of threads
-p option is the extra parameters given to bowtie2. In this case -f.
The five arguments are:
pair1, the fasta/fastq file with the #1 mates
pair2, the fasta/fastq file with the #2 mates
pair_name, a name for the pair used to prefix output files
assembly, a fasta file of the assembly to map the pairs to
assembly_name, a name for the assembly, used to postfix outputfiles
outputfolder, the output files will end up in this folder
Generate coverage table
Use the BAM files of each sample to create a table with the coverage of each contig per sample.
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cd $CONCOCT_EXAMPLE/map
[MScBioinf@quince-srv2 ~/CTutorial_Umer/map]$ python $CONCOCT/scripts/gen_input_table.py --isbedfiles --samplenames <(for s in Sample*; do echo $s | cut -d'_' -f1; done) ../contigs/velvet_71_c10K.fa */bowtie2/asm_pair-smds.coverage > concoct_inputtable.tsv
[MScBioinf@quince-srv2 ~/CTutorial_Umer/map]$ mkdir $CONCOCT_EXAMPLE/concoct-input
[MScBioinf@quince-srv2 ~/CTutorial_Umer/map]$ mv concoct_inputtable.tsv $CONCOCT_EXAMPLE/concoct-input/
Alternative workflow to generate the coverage table
Sometimes we are interested in generating a coverage table from the sequence files where we have lost the paired-end information, for example, in scenarios where we are dealing with human associated microbes after having filtered out our samples for human contaminants using
DeconSeq
. In such a case, the following steps may be useful for generating a coverage table. Please go through the
Linux command line exercises for NGS data processing tutorial to understand
BWA
,
SAMtools
, and
BEDtools
better.
To start with will organise our data in such a manner that there is a folder for every sample and within each folder there is a
Raw
folder containing untrimmed paired-end FASTQ files:
Sample1/Raw/*_R1.fastq
Sample1/Raw/*_R2.fastq
Sample2/Raw/*_R1.fastq
Sample2/Raw/*_R2.fastq
Sample3/Raw/*_R1.fastq
Sample3/Raw/*_R2.fastq
Sample4/Raw/*_R1.fastq
Sample4/Raw/*_R2.fastq
Sample5/Raw/*_R1.fastq
Sample5/Raw/*_R2.fastq
Step 1: We trim the reads using
sickle
and generate
*R1_trimmed.fastq
and
*R2_trimmed.fastq
file in
Raw
folder:
$ for i in $(ls -d *); do cd $i; echo processing $i; sickle pe -f Raw/*_R1.fastq -r Raw/*_R2.fastq -o $i.R1_trimmed.fastq -p $i.R2_trimmed.fastq -s $i.singlet.fastq -t "sanger" -q 20 -l 10; cd ..; done
Step 2: We convert the FASTQ files to FASTA files using
PRINSEQ
. The resulting files will have the name
*_prinseq_good_*.fasta
$ for i in $(ls -d *); do cd $i; echo processing $i; prinseq-lite.pl -fastq $i.R1_trimmed.fastq -out_format 1;prinseq-lite.pl -fastq $i.R2_trimmed.fastq -out_format 1; cd ..; done
Step 3: We collate the forward and reverse reads together and generate a collated FASTA file
$ for i in $(ls -d *); do cd $i; echo processing $i; cat *R1_trimmed_prinseq*.fasta *R2_trimmed_prinseq*.fasta > ${i}_collated.fasta; cd ..; done
Step 4: We remove human contaminants using
DeconSeq
which will generate two files
*clean*.fa
and
*cont*.fa
.
$ for i in $(ls -d *); do cd $i; cd Raw; echo processing $i; perl /home/opt/deconseq-standalone-0.4.3/deconseq.pl -dbs hsref -out_dir . -f *_collated.fasta; cd ..; cd ..; done
Step 5:
Assembling reads. We can then pool all the sample reads together
$ cat */*clean.fa > /path_to_assembly_folder/pooled_samples.fa
We can then use
idba-ud
to assemble the reads
$ cd /path_to_assembly_folder
$ idba_ud -l pooled_samples.fa -o collated_assembly --mink 21 --maxk 121 --num_threads 20 --pre_correction
Say we get the best assembly done for kmer 121, then we will have a file
contig-121.fa
in the
collated_assembly
folder.
We need to extract only those contigs that have length > 1000. You can get the following script from
here.
$ perl ~/bin/contig_size_select.pl -low 1000 -high 10000000 contig-121.fa > filtered_contigs.fa
Step 6: We map the reads back to contigs and generate SAM files:
$ for i in $(ls -d *); do cd $i;echo processing $i;bwa mem /path_to_assembly_folder/collated_assembly/filtered_contigs.fa Raw/*clean*.fa > $i.sam; cd ..; done
Step 7: We convert the SAM files to BAM files:
$ for i in $(ls -d *); do cd $i;echo processing $i;samtools view -h -b -S $i.sam > $i.bam; cd ..; done
Step 8: We extract mapped reads:
$ for i in $(ls -d *); do cd $i;echo processing $i;samtools view -b -F 4 $i.bam > $i.mapped.bam; cd ..; done
Step 9: We need to generate the
lengths.genome
in each folder which will be used by
bedtools
to generate the coverage information
$ for i in $(ls -d *); do cd $i;echo processing $i;samtools view -H $i.mapped.bam | perl -ne 'if ($_ =~ m/^\@SQ/) { print $_ }' | perl -ne 'if ($_ =~ m/SN:(.+)\s+LN:(\d+)/) { print $1, "\t", $2, "\n"}' > lengths.genome ; cd ..; done
Step 10: We also need to sort the BAM files:
$ for i in $(ls -d *); do cd $i;echo processing $i;samtools sort -m 1000000000 $i.mapped.bam $i.mapped.sorted ; cd ..; done
Step 11: Now generate the coverage information for each sample:
$ for i in $(ls -d *); do cd $i;echo processing $i;bedtools genomecov -ibam $i.mapped.sorted.bam -g lengths.genome > ${i}_coverage.txt; cd ..; done
Step 12: Generate coverage information in csv format that we will use with my
perl script to generate a single table
$ for i in $(ls -d *); do cd $i;echo processing $i; awk -F"\t" '{l[$1]=l[$1]+($2 *$3);r[$1]=$4} END {for (i in l){print i","(l[i]/r[i])}}' ${i}_coverage.txt > ${i}_coverage.csv; cd ..; done
Step 13: In the main folder, collate the coverage tables from individual samples together:
$ perl ~/bin/collateResults.pl -f . -p _coverage.csv > coverage_table.csv
Generate linkage table
The same BAM files can be used to give linkage per sample between contigs:
[MScBioinf@quince-srv2 ~/CTutorial_Umer/map]$ cd $CONCOCT_EXAMPLE/map
[MScBioinf@quince-srv2 ~/CTutorial_Umer/map]$ python $CONCOCT/scripts/bam_to_linkage.py -m 8 --regionlength 500 --fullsearch --samplenames <(for s in Sample*; do echo $s | cut -d'_' -f1; done) ../contigs/velvet_71_c10K.fa Sample*/bowtie2/asm_pair-smds.bam > concoct_linkage.tsv
[MScBioinf@quince-srv2 ~/CTutorial_Umer/map]$ mv concoct_linkage.tsv $CONCOCT_EXAMPLE/concoct-input/
Run CONCOCT
To see possible parameter settings with a description run
[MScBioinf@quince-srv2 ~/CTutorial_Umer/map]$ $CONCOCT/bin/concoct --help
usage: concoct [-h] [--coverage_file COVERAGE_FILE]
[--composition_file COMPOSITION_FILE] [-c CLUSTERS]
[-k KMER_LENGTH] [-l LENGTH_THRESHOLD] [-r READ_LENGTH]
[--total_percentage_pca TOTAL_PERCENTAGE_PCA] [-b BASENAME]
[-s SEED] [-i ITERATIONS] [-e EPSILON] [--no_cov_normalization]
[--no_total_coverage] [-o] [-d] [-v]
optional arguments:
-h, --help show this help message and exit
--coverage_file COVERAGE_FILE
specify the coverage file, containing a table where
each row correspond to a contig, and each column
correspond to a sample. The values are the average
coverage for this contig in that sample. All values
are separated with tabs.
--composition_file COMPOSITION_FILE
specify the composition file, containing sequences in
fasta format. It is named the composition file since
it is used to calculate the kmer composition (the
genomic signature) of each contig.
-c CLUSTERS, --clusters CLUSTERS
specify maximal number of clusters for VGMM, default
400.
-k KMER_LENGTH, --kmer_length KMER_LENGTH
specify kmer length, default 4.
-l LENGTH_THRESHOLD, --length_threshold LENGTH_THRESHOLD
specify the sequence length threshold, contigs shorter
than this value will not be included. Defaults to
1000.
-r READ_LENGTH, --read_length READ_LENGTH
specify read length for coverage, default 100
--total_percentage_pca TOTAL_PERCENTAGE_PCA
The percentage of variance explained by the principal
components for the combined data.
-b BASENAME, --basename BASENAME
Specify the basename for files or directory where
outputwill be placed. Path to existing directory or
basenamewith a trailing '/' will be interpreted as a
directory.If not provided, current directory will be
used.
-s SEED, --seed SEED Specify an integer to use as seed for clustering. 0
gives a random seed, 1 is the default seed and any
other positive integer can be used. Other values give
ArgumentTypeError.
-i ITERATIONS, --iterations ITERATIONS
Specify maximum number of iterations for the VBGMM.
Default value is 500
-e EPSILON, --epsilon EPSILON
Specify the epsilon for VBGMM. Default value is 1.0e-6
--no_cov_normalization
By default the coverage is normalized with regards to
samples, then normalized with regards of contigs and
finally log transformed. By setting this flag you skip
the normalization and only do log transorm of the
coverage.
--no_total_coverage By default, the total coverage is added as a new
column in the coverage data matrix, independently of
coverage normalization but previous to log
transformation. Use this tag to escape this behaviour.
-o, --converge_out Write convergence info to files.
-d, --debug Debug parameters.
-v, --version show program's version number and exit
[MScBioinf@quince-srv2 ~/CTutorial_Umer/map]$
We will only run CONCOCT
for some standard settings here. First we need to parse the input table to just contain the mean coverage for each contig in each sample:
[MScBioinf@quince-srv2 ~/CTutorial_Umer/map]$ cd $CONCOCT_EXAMPLE
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cut -f1,3-26 concoct-input/concoct_inputtable.tsv > concoct-input/concoct_inputtableR.tsv
Then run CONCOCT
with 40 as the maximum number of cluster -c 40
, that we guess is appropriate for this data set:
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cd $CONCOCT_EXAMPLE
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ concoct -c 40 --coverage_file concoct-input/concoct_inputtableR.tsv --composition_file contigs/velvet_71_c10K.fa -b concoct-output/
Up and running. Check /home/MScBioinf/CTutorial_Umer/concoct-output/log.txt for progress
CONCOCT Finished, the log shows how it went.
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$
The program generates a number of files in the output directory that can be set with the -b
parameter and will be the present working directory by default.
Evaluate output
First we can visualise the clusters in the first two PCA dimensions:
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cd $CONCOCT_EXAMPLE
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ mkdir evaluation-output
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ Rscript $CONCOCT/scripts/ClusterPlot.R -c concoct-output/clustering_gt1000.csv -p concoct-output/PCA_transformed_data_gt1000.csv -m concoct-output/pca_means_gt1000.csv -r concoct-output/pca_variances_gt1000_dim -l -o evaluation-output/ClusterPlot.pdf
The resulting ClusterPlot.pdf
is:
We can also compare the clustering to species labels. For this test data set we know these labels, they are given in the file clustering_gt1000_s.csv
. For real data labels may be obtained through taxonomic classification, e.g. using:
https://github.com/umerijaz/TAXAassign
In either case we provide a script Validate.pl
for computing basic metrics on the cluster quality:
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cd $CONCOCT_EXAMPLE
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cp $CONCOCT_TEST/evaluation-output/clustering_gt1000_s.csv evaluation-output/
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ $CONCOCT/scripts/Validate.pl --cfile=concoct-output/clustering_gt1000.csv --sfile=evaluation-output/clustering_gt1000_s.csv --ofile=evaluation-output/clustering_gt1000_conf.csv --ffile=contigs/velvet_71_c10K.fa
N M TL S K Rec. Prec. NMI Rand AdjRand
684 684 6.8023e+06 5 4 0.898494 0.999604 0.842320 0.912366 0.824804
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$
This script requires the clustering output by CONCOCT
, i.e., concoct-output/clustering_gt1000.csv
. This file is a comma separated file listing each contig id followed by the cluster index and the species labels that have the same format but with a text label rather than a cluster index. This gives the no. of contigs N clustered, the number with labels M, the number of unique labels S, the number of clusters K, the recall, the precision, the normalised mutual information (NMI), the Rand index, and the adjusted Rand index. It also generates a file called a confusion matrix with the frequencies of each species in each cluster. We provide a further script for visualising this as a heatmap:
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ $CONCOCT/scripts/ConfPlot.R -c evaluation-output/clustering_gt1000_conf.csv -o evaluation-output/clustering_gt1000_conf.pdf
This generates a file with normalised frequencies of contigs from each cluster across species (clustering_gt1000_conf.pdf
):
To understand how to validate clustering performance using different clustering indices, read the
reference slides on
clust_validity.R
as well as play with the
example datasets.
clust_validity.R
[
Usage] takes a CSV file of N d-dimensional features, performs K-means or dp-means clustering and chooses the optimum number of clusters based on either of the implemented internal clustering validation indices. Additionally, if the CSV file contains a column titled "True_Clusters" containing true clusters membership for each object, you can use it to validate clustering performance.
Validation using single-copy core genes
We can also evaluate the clustering based on single-copy core genes. You first need to find genes on the contigs and functionally annotate these. Here we used prodigal for gene prediction and annotation, but you can use anything you want:
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cd $CONCOCT_EXAMPLE
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ mkdir -p $CONCOCT_EXAMPLE/annotations/proteins
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ prodigal -a annotations/proteins/velvet_71_c10K.faa -i contigs/velvet_71_c10K.fa -f gff -p meta > annotations/proteins/velvet_71_c10K.gff
We used RPS-Blast
to COG annotate the protein sequences using the script RSBLAST.sh
. You need to set the evironmental variable COGSDB_DIR
:
[uzi@quince-srv2 ~/CTutorial_Umer]$ export COGSDB_DIR=/home/opt/rpsblast_db/
The script furthermore requires GNUparallel
and rpsblast
. Here we run it on eight cores:
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ $CONCOCT/scripts/RPSBLAST.sh -f annotations/proteins/velvet_71_c10K.faa -p -c 8 -r 1
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ mkdir $CONCOCT_EXAMPLE/annotations/cog-annotations
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ mv velvet_71_c10K.out annotations/cog-annotations/
The blast output has been placed in:
$CONCOCT_TEST/annotations/cog-annotations/velvet_71_c10K.out
Finally, we filtered for COGs representing a majority of the subject to ensure fragmented genes are not over-counted and generated a table of counts of single-copy core genes in each cluster generated by CONCOCT
. Remember to use a real email adress, this is supplied since information is fetched from ncbi using their service eutils
, and the email is required to let them know who you are.
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cd $CONCOCT_EXAMPLE
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ $CONCOCT/scripts/COG_table.py -b annotations/cog-annotations/velvet_71_c10K.out -m $CONCOCT/scgs/scg_cogs_min0.97_max1.03_unique_genera.txt -c concoct-output/clustering_gt1000.csv --cdd_cog_file $CONCOCT/scgs/cdd_to_cog.tsv > evaluation-output/clustering_gt1000_scg.tab
The script requires the clustering output by concoct concoct-output/clustering_gt1000.csv, a file listing a set of SCGs (e.g. a set of COG ids) to use scgs/scg_cogs_min0.97_max1.03_unique_genera.txt
and a mapping of Conserved Domain Database ids to COG ids $CONCOCT/scgs/cdd_to_cog.tsv
. If these protein sequences were generated by PROKKA, the names of the contig ids needed to be recovered from the GFF file. Since prodigal has been used, the contig ids instead are recovered from the protein ids using a separator character, in which case only the string before (the last instance of) the separator will be used as contig id in the annotation file. In the case of prodigal the separator that should be used is _ and this is the default value, but other characters can be given through the -separator
argument.
The output file is a tab-separated file with basic information about the clusters (cluster id, ids of contigs in cluster and number of contigs in cluster) in the first three columns, and counts of the different SCGs in the following columns.
This can also be visualised graphically using the R
script:
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ cd $CONCOCT_EXAMPLE
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ $CONCOCT/scripts/COGPlot.R -s evaluation-output/clustering_gt1000_scg.tab -o evaluation-output/clustering_gt1000_scg.pdf
This generates the clustering_gt1000_scg.pdf
file:
Incorporating linkage information
To perform a hierarchical clustering of the clusters based on linkage we simply run:
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ $CONCOCT/scripts/ClusterLinkNOverlap.pl --cfile=concoct-output/clustering_gt1000.csv --lfile=concoct-input/concoct_linkage.tsv --covfile=concoct-input/concoct_inputtableR.tsv --ofile=concoct-output/clustering_gt1000_l.csv
0.431034482758621
4 2->2 3->3 0.431034482758621 58 0.948501147569037 0.0303217991829503 0.0235015008038782
3
D0,1,E0,D0
D1,1,E1,D1
D3,2,E2,D3,D2
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$
The output indicates that the clusters have been reduced from four to three. The new clustering is given by concoct-output/clustering_gt1000_l.csv
. This is a significant improvement in recall:
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ $CONCOCT/scripts/Validate.pl --cfile=concoct-output/clustering_gt1000_l.csv --sfile=evaluation-output/clustering_gt1000_s.csv --ofile=evaluation-output/clustering_gt1000_conf.csv
N M TL S K Rec. Prec. NMI Rand AdjRand
684 684 6.8400e+02 5 3 1.000000 0.997076 0.991062 0.998416 0.996832
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$
Finally, destroy the bash
shell to revert back to your normal python
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ exit
exit
[MScBioinf@quince-srv2 ~/CTutorial_Umer]$ python -V
Python 2.6.6
Last Updated by Dr Umer Zeeshan Ijaz on 4/2/2015.