Software for Pyrosequencing Noise Removal
This page and associated software are no longer supported. Instead an improved version of PyroNoise, AmpliconNoise, including a chimera checker and much more sensitive noise removal is available from the Google Code project AmpliconNoise. An associated paper will be appearing soon.
This web page contains links to the source code of the diversity estimation software used in Quince et al. "Noise and the Accurate Determination of Microbial Diversity from 454 Pyrosequencing Data". This code is made available without warranty for non-commercial use only (see copyright notice below). Any publications resulting from applications of this software should cite the above article. If you found this software useful or would like some help installing or using it please e-mail me: Chris Quince.
Click the following links to download a zip file containing data files and source code:
PyroNoise2.zip Update containing bug fix for PCluster that prevents segmentation faults on some machines (due to Jonas Paulsen) and some extra command line parameters for PCluster and FDist that allow you to override the hard-coded LookUp file location (suggested by Jens Reeder). There is also an improved Perl script for generating the .dat files FlowsF2.pl (usage ./FlowsF2.pl primer outstub key < your.sff.txt).
The original code from the paper PyroNoise.zip.
The software currently only runs on Linux computers with MPI. To install the software:
- Move the downloaded file to where you want to create the programs and type "unzip PyroNoise.zip". This will unpack the source code into a Directory tree.
- Move to the three program directories (FDist, QCluster, PCluster) with the command "cd ../$ProgramDirectory" and type make (after changing the makefile compiler flags). This will generate two MPI executables (FDist, PCluster) and one regular executable (QCluster). Copy the executables to the bin directory and then add the bin directory to your path.
The MPI executables read in the GSFLX lookup file "../Data/LookUp.dat". The hard coded reference to this file can be changed in the headers (LOOKUP_FILE) otherwise FDist and PCluster must be run in their directories, the Examples directory, or the bin directory to maintain this path. The starting point for de-noising data is a .dat file of flowgrams an example "PriestPot_C7.dat" is given in the examples directory. To process this file move to the Examples directory...
- The .dat file format is very simple:
The first line has the number of flowgrams followed by the number of flows: N M
Each of the N flowgram entries has the format: id length1 flow1 flow2 ... flowM
where id is just an integer identifier, length is the number of 'clean' flows, followed
by all M flows (although only length will ever be used).
- run FDist to get initial distances between flowgrams:
mpirun -np 4 FDist -in PriestPot_C7.dat -out PriestPot_C7
(The algorithm used here is slightly different to that detailed in the supplement of
Quince et al. 2009 rather than summing over homopolymer lengths with a prior the length that
maximises the likelihood only is used. This does not alter the test results and is more
consistent with the PCluster algorithm. The original algorithm is available on request.)
- run QCluster to perform complete linkage hierarchical clustering with these distances:
QCluster -in PriestPot_C7.fdist -out PriestPot_C7
(QCluster generates various files of which the most useful is a .list file with the
clusters given at 0.01 increments - alterable by -r option)
- run PCluster to de-noise:
mpirun -np 4 PCluster -din PriestPot_C7.dat -out PriestPot_C7 -lin PriestPot_C7.list -s 15.0 -c 0.05 > PriestPot_C7.pout
(This is the actual flowgram clustering program. It is uses the list file from QCluster as its
initial conditions. The cut-off for these conditions are controlled by -c. The -s option is
the reciprocal of the cluster size i.e 1/\sigma in the supplement. So a smaller -s gives tighter clustering. Many out files are produced. The most important is a fasta file of de-noised sequences $out_cd.fa. Each entry has a frequency N associated with the fasta id in the format >SeqName_N. This frequency is the number of flowgrams mapping to this sequence. A directory $out will also be created containing fasta files giving the sequences of each flow that map to that clean sequence. These are very useful for checking how pyrosequencing and PCR point mutations are being removed.)
Perl Scripts and Extracting Flowgrams from Sff Files
The bin directory contains Perl scripts that are used below for tackling larger data sets. They are also necessary to extract flowgrams from sff files. As an example given the ArtificialGSFLX.sff file downloaded from here. We would run:
- To generate text translation of sff file:
sffinfo ArtificialGSFLX.sff > ArtificialGSFLX.sff.txt
(sffinfo is an executable available from 454)
- To parse out clean flowgrams of length >= 200 and with primer:
FlowsF.pl < ArtificialGSFLX.sff.txt > ArtificialGSFLXClean.dat
(This file contains a hardcoded reference to the primer we used - that will need to be changed appropriately)
- To get clean sequences:
SequenceF.pl < ArtificialGSFLX.sff.txt > ArtificialGSFLXClean.fa
(This file too contains a hardcoded reference to a primer)
- To split a sff text file on a set of barcode keys given in file "Primers.csv":
SplitKeys.pl < Barcodes.sff.txt > KeyStats.txt
(This will generate a set of key.dat files)
- These files then need to be cleaned:
Clean.pl < key.dat > key.cdat
(The .cdat files can then be used as input to FDist, PCluster etc EXCEPT the flag -ni must be added to both these programs to indicate that no integer index is present in these files. The same is true if the .dat files from FlowsF.pl are used directly without the initial sequence clustering
Two further executables need to be compiled and installed to tackle larger data sets (NDist, FastaUnique). As above move to the two program directories (NDist, FastaUnique) with the command "cd ../$ProgramDirectory" and type make (after changing the makefile compiler flags). This will generate one MPI executable (NDist) and one regular executable (FastaUnique). Copy the executables to the bin directory which should already be added to your path.
Running Larger Data Sets
PCluster can currently handle a maximum of 10,000 flowgrams. Larger data sets have
potential numerical error overflow problems and exhaust typically available memory (~8GB).
To deal with larger data sets two not exclusive strategies can be employed. The
first is an initial clustering of flowgrams based on sequence. As an example I will show
how this was done for the 'divergent sequences' test data set. The following assumes all executables have been added to bin in your path.
- Begin by parsing off the primer and truncating sequences to a length of 100:
sed 's/^ATTAGATACCC[ACTG]GGTAG//' ArtificialGSFLXClean.fa | Truncate.pl 100 > ArtificialGSFLXClean_T.fa
- Use application FastaUnique to generate a fasta file of unique sequences only:
FastaUnique -in ArtificialGSFLXClean_T.fa > ArtificialGSFLXClean_U.fa
(this also generates a mapping of unique truncated sequences to positions in the original fasta file)
- Get distances based on a Needleman-Wunsch global alignment of unique sequences using the NDist MPI program:
mpirun -np 32 NDist -in ArtificialGSFLXClean_U.fa > ArtificialGSFLXClean_U.ndist
(obviously adjust the argument -np #noprocs according to your cluster size. This protocol differs slightly from that used in Quince et al. 2009 in that the pairwise alignments mean that we can avoid doing a multiple alignment.)
- Perform complete linkage clustering on those distances:
QCluster -in ArtificialGSFLXClean_U.ndist -out ArtificialGSFLXClean_U
- Map the original fasta sequences into this clustering:
FillMap.pl ArtificialGSFLXClean_T.map ArtificialGSFLXClean_U.list > ArtificialGSFLXClean.list
- Seperate dat file using this clustering at 20% sequence divergence:
ClusterDist.pl ArtificialGSFLXClean.dat ArtificialGSFLXClean.list 0.20
(smaller or larger cut-offs can be used to get clusters of the right size < 10,000 flowgrams)
- We now have directories C0, ..., CN (and sometimes C(N+1)+ which is a concatenation of small clusters) containing clustered dat files, these are denoised
individually as above, the following lines in a bash shell script will run FDist on the
complete data set:
for c in C*
if [ -d $c ] ; then
mpirun -np 32 FDist -in $c/$c.dat -out $c/$c > $c/$c.fout
- These are then clustered:
for c in C*
if [ -d $c ] ; then
QCluster -in $c/$c.fdist -out $c/$c > $c/$c.qout
- And de-noised:
for c in C*
if [ -d $c ] ; then
mpirun -np 32 PCluster -din $c/$c.dat -out $c/$c -lin $c/$c.list -s 15.0 -c 0.05 > $c/$c.pout
(The -s and -c parameters may need to be adjusted to get the best results)
- The de-noised fasta files can now be concatenated to give final result:
cat C*/*_cd.fa > ArtificialGSFLXDenoised.fa
- The last step is to remove chimeras that will appear soon...
Copyright and Lack of Warranty
This software and documentation is copyright © 2009 by Christopher Quince.
Permission is granted for anyone to copy, use, or modify these programs and documents for purposes of research or education, provided this copyright notice is retained, and note is made of any changes that have been made.
These programs and documents are distributed without any warranty, express or implied. As the programs were written for research purposes only, they have not been tested to the degree that would be advisable in any important application. All use of these programs is entirely at the user's own risk.