Software for Pyrosequencing Noise Removal

This page and associated software are no longer supported. Instead an improved version of PyroNoise, AmpliconNoise, including a chimera checker and much more sensitive noise removal is available from the Google Code project AmpliconNoise. An associated paper will be appearing soon.

This web page contains links to the source code of the diversity estimation software used in Quince et al. "Noise and the Accurate Determination of Microbial Diversity from 454 Pyrosequencing Data". This code is made available without warranty for non-commercial use only (see copyright notice below). Any publications resulting from applications of this software should cite the above article. If you found this software useful or would like some help installing or using it please e-mail me: Chris Quince.


Click the following links to download a zip file containing data files and source code: Update containing bug fix for PCluster that prevents segmentation faults on some machines (due to Jonas Paulsen) and some extra command line parameters for PCluster and FDist that allow you to override the hard-coded LookUp file location (suggested by Jens Reeder). There is also an improved Perl script for generating the .dat files (usage ./ primer outstub key < your.sff.txt).

The original code from the paper


The software currently only runs on Linux computers with MPI. To install the software:


The MPI executables read in the GSFLX lookup file "../Data/LookUp.dat". The hard coded reference to this file can be changed in the headers (LOOKUP_FILE) otherwise FDist and PCluster must be run in their directories, the Examples directory, or the bin directory to maintain this path. The starting point for de-noising data is a .dat file of flowgrams an example "PriestPot_C7.dat" is given in the examples directory. To process this file move to the Examples directory...

Perl Scripts and Extracting Flowgrams from Sff Files

The bin directory contains Perl scripts that are used below for tackling larger data sets. They are also necessary to extract flowgrams from sff files. As an example given the ArtificialGSFLX.sff file downloaded from here. We would run:

Auxilliary Software

Two further executables need to be compiled and installed to tackle larger data sets (NDist, FastaUnique). As above move to the two program directories (NDist, FastaUnique) with the command "cd ../$ProgramDirectory" and type make (after changing the makefile compiler flags). This will generate one MPI executable (NDist) and one regular executable (FastaUnique). Copy the executables to the bin directory which should already be added to your path.

Running Larger Data Sets

PCluster can currently handle a maximum of 10,000 flowgrams. Larger data sets have potential numerical error overflow problems and exhaust typically available memory (~8GB). To deal with larger data sets two not exclusive strategies can be employed. The first is an initial clustering of flowgrams based on sequence. As an example I will show how this was done for the 'divergent sequences' test data set. The following assumes all executables have been added to bin in your path.

Copyright and Lack of Warranty

This software and documentation is copyright © 2009 by Christopher Quince.

Permission is granted for anyone to copy, use, or modify these programs and documents for purposes of research or education, provided this copyright notice is retained, and note is made of any changes that have been made.

These programs and documents are distributed without any warranty, express or implied. As the programs were written for research purposes only, they have not been tested to the degree that would be advisable in any important application. All use of these programs is entirely at the user's own risk.