Metaproteomics Data Analysis Workflow

by Umer Zeeshan Ijaz, Florence Abram, Aoife Joyce, and Chris Quince

This tutorial gives a step-by-step description of the metaproteomics data analysis workflow. We start off with ProteinPilot, a software tool for protein identification and quantitation that uses a hybrid sequence tag and database search approach using feature probabilities to identify hundreds of sequence variant in a single step for the data generated from mass spectrometers. The software uses peptide results, condenses it down to the most defensible set of detected proteins with ambiguity and reports their accession numbers (Uniprot) along with quality statistics on peptide and protein identification. The final results can be exported as an excel file:



In the above figure, assignment of peptide sequences to different proteins are shown in "Protein N" column in the first sheet. For example, the first protein contains 20 such sequences. The "Representative Accession" lists all the ambigious protein assignments identifiable Uniprot IDs. We have written the processPROT.py script that can read the Excel files generated from ProteinPilot, download the protein sequences and associated metadata from Swiss-Prot using ExPASy server, and filter out any proteins for which there is not enough supporting evidence (less than 2 peptides sequences assigned).
[uzi@quince-srv2 ~/Metaproteomics]$ ./processPROT.py -i Sample_1_a.xls -i Sample_1_b.xls -i Sample_1_c.xls > Sample_1.tsv  
You can use -i switch to provide as many Excel files as you have for a given sample, with the tab-delimited output file listing proteins with unique numeric identifiers with the format [Sample ID]_[Protein ID] as shown in the first column in the following figure. We output a 5-tuple record (ID, Uniprot Identifier, Name, Assignment to Organism, Taxonomy ID, Classification Path of the Organism, Protein Sequence). Columns with missing data have __NA__ value in them.



Since the last column contains the protein sequences, we can use the following one-liner to extract the unique protein sequences in the form of a FASTA file. Since, we are only interested in microbial communities, we also remove any "Eukaryota" entries from the table:
[uzi@quince-srv2 ~/Metaproteomics]$ grep -vi "Eukaryota" Sample_1.tsv | awk -F"\t" '$8!="__NA__"{print $8}' | sort | uniq | awk '{print ">PROT"++i"\n"$0}' > Sample_1.fasta
[uzi@quince-srv2 ~/Metaproteomics]$ head Sample_1.fasta

>PROT1
AAPAAAAPAAAPAAQAAAPAAAGGDDAILSPAARKLAEEAGIDPNSIAGTGKGGRVTKEDVVAAVEAKKNAPAALAKPAAPAAEAPIFAAGDRVEKRVPMTRLRAKVAERLVEAQSAMAMLTTFNEVNMKPIMDLRSKYKDLFEKKHNGVRLGFMSFFVKAATEALKRFPGVNASIDGNDIVYHGYQDIGVAVSSDRGLVVPVLRNAEFMSLAEIEGGIANFGKKAKEGKLTIEDMTGGTFTISNGGVFGSLLSTPIVNPPQTAILGMHKIQERPMAVNGQVVILPMMYLALSYDHRLIDGKEAVSFLVAIKDLLEDPARLLLDV
>PROT2
AAPAAAGGDDAILSPAARKLAEEAGIDPNSIAGTGKGGRVTKEDVVAAVEAKKNAPAAPAKPAAPAAEAPIFAAGDRVEKRVPMTRLRAKVAERLVEAQSAMAMLTTFNEVNMKPIMDLRSKYKDLFEKKHNGVRLGFMSFFVKAATEALKRFPGVNASIDGNDIVYHGYQDIGVAVSSDRGLVVPVLRNAEFMSLAEIEGGIANFGKKAKEGKLTIEDMTGGTFTISNGGVFGSLLSTPIVNPPQTAILGMHKIQERPMAVNGQVVILPMMYLALSYDHRLIDGKEAVSFLVAIKDLLEDPARLLLDV
>PROT3
AKSKFERNKPHVNIGTIGHVDHGKTSLTAAITKYFGEFKAYDQIDAAPEEKARGITISTAHVEYETPNRHYAHVDCPGHADYVKNMITGAAQMDGAILVVSAADGPMPQTREHILLARQVGVPAIVVFLNKVDQVDDAELLELVELEVRELLSSYEFPGDDIPIVKGSALAALEDSDKKIGEDAIRELMAAVDAYIPTPERPIDQPFLMPIEDVFSISGRGTVVTGRVERGIVKVGEEIEIVGIRPTTKTTCTGVEMFRKLLDQGQAGDNIGALLRGVDRNGVERGQILCKPGSVKPHRKFKAEAYILTKEEGGRHTPFFTNYRPQFYFRTTDVTGIVTLPEGTEMVMPGDNVTVDVELIVPIAMEEKLRFAIREGGRTVGAGIVASIVE
>PROT4
APAAAAPAAAPAAQAAAPAAAGGDDAILSPAARKLAEEAGIDPNSIAGTGKGGRVTKEDVVAAVEAKKNAPAAPAKPAAPAAEAPIFAAGDRVEKRVPMTRLRAKVAERLVEAQSAMAMLTTFNEVNMKPIMDLRSKYKDLFEKKHNGVRLGFMSFFVKAATEALKRFPGVNASIDGNDIVYHGYQDIGVAVSSDRGLVVPVLRNAEFMSLAEIEGGIANFGKKAKEGKLTIEDMTGGTFTISNGGVFGSLLSTPIVNPPQTAILGMHKIQERPMAVNGQVVILPMMYLALSYDHRLIDGKEAVSFLVAIKDLLEDPARLLLDV
>PROT5
ATLLAQAIIREGMKNVTAGANPMVVRKGIQDAVDAAVEALHKNSKKVTCSEDIARVATISAGDETVGKLIAEAMEKVTSDGVITVEESKTSETYSEVVEGMMFDRGYISPYMATDTEKMEAIMDDAYLLITDKKISNIQEILPLLEKIVQSGKKLVIIAEDVEGEALTTLILNKLRGTFNCAAV
Next, we use RAPsearch to search our protein sequences against the KEGG database and get a tab-delimited blast output file (identified by *.out.m8 ending). To speed up RAPsearch, you can specify the number of cores to run on using -z switch.
[uzi@quince-srv2 ~/Metaproteomics]$ rapsearch -d /home/opt/keggs_database/kegg_db -z 2 -q Sample_1.fasta -o Sample_1.out
[uzi@quince-srv2 ~/Metaproteomics]$ head *.out.m8
# RAPSearch
# Job submitted: Thu Jul  3 17:57:51 2014
# Query : Sample_1.fasta
# Subject : /home/opt/keggs_database/kegg_db
# Fields: Query	Subject	identity	aln-len	mismatch	gap-openings	q.start	q.end	s.start	s.end	log(e-value)	bit-score
PROT1	psa:PST_1876	89.2308	325	35	0	1	325	84	408	-156.9	557.37
PROT1	pmy:Pmen_2502	84.7095	327	48	1	1	325	84	410	-149.71	533.49
PROT1	psb:Psyr_2010	78.7692	325	67	1	1	325	89	411	-137.65	493.43
PROT1	hch:HCH_04744	73.2143	336	74	4	2	325	80	411	-130.46	469.54
PROT1	sde:Sde_2105	71.6049	324	89	2	5	325	80	403	-124.55	449.9
We are only interested in finding the general function of microbial community. This is done by identifying enzymes in our dataset. For this purpose, we use a REST-style KEGG API to link KEGG entries (2nd column in the output file) to EC numbers:
[uzi@quince-srv2 ~/Metaproteomics]$ awk -F"\t" '!/#/{print $1","$2}' Sample_1.out.m8 | while IFS=$"," read -r -a mA; do echo -e "${mA[0]}\t$(curl -s http://rest.kegg.jp/link/ec/${mA[1]})"; done | awk '{print $1"\t"$3}' > ec_assignments.txt
[uzi@quince-srv2 ~/Metaproteomics]$ head ec_assignments.txt
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
PROT1	ec:2.3.1.61
RAPsearch reports multiple matches for a single protein sequence. We assign the most abundant EC number to each protein sequence:
[uzi@quince-srv2 ~/Metaproteomics]$ sort ec_assignments.txt | uniq -c | grep -i "ec:" | awk '{gsub("^ec:","",$3);if(ec[$2]){if($1>=c[$2]){ec[$2]=$3;c[$2]=$1}} else {ec[$2]=$3;c[$2]=$1}} END{for (i in ec) print i"\t"ec[i]}' > Sample_1.ec
[uzi@quince-srv2 ~/Metaproteomics]$ tail Sample_1.ec
PROT1	2.3.1.61
PROT78	4.2.1.11
PROT2	2.3.1.61
PROT136	1.1.1.86
PROT137	1.12.7.2
PROT4	2.3.1.61
PROT138	2.7.9.1
PROT139	3.2.1.23
PROT95	1.1.1.86
PROT9	3.6.3.14

We next construct parsimonious pathways using MinPath as it does not attempt to reconstruct entire pathways from a given set of protein sequences identified in a protemics experiment, but determines the minimal set of biological pathways that must exist in the biological system to explain the input protein sequences. It takes a set of reference pathways and a set of proteins that can be mapped to one or more pathways, and finds the minimum number of pathways that can explain all the protein functions. We will using MinPath
To use MinPath, you need to setup environmental variables in your ~/.bashrc file after downloading the software
# Required for minpath
export PATH=/home/opt/MinPath:$PATH
export MinPath=/home/opt/MinPath
MinPath comes with a mapping file ec2path that lists the reference pathways from MetaCyc and all the EC numbers associated with them. MetaCyc serves as an encyclopedia of metabolism containing more than 2151 patways from more than 2500 different organisms.
[uzi@quince-srv2 /home/opt/MinPath/data]$  head -50 ec2path
#Metacyc pathway and ec mapping file
#Pathway	EC
PWY0-1479	3.1.26.12
PWY0-1479	2.7.7.56
PWY0-1479	3.1.13.5
PWY0-1479	3.1.26.5
PWY0-1479	3.1.13.5
PWY0-1479	2.7.7.56
PWY0-1479	3.1.26.5
PWY0-1479	3.1.13.1
PWY0-1479	3.1.13.1
PWY0-1479	3.1.26.12
PWY-6146	4.1.1.31
PWY-6146	4.2.1.1
PWY-6146	6.4.1.1
PWY-6146	4.1.1.31
PWY-6146	2.7.1.40
PWY-6146	4.2.1.11
PWY-6146	5.4.2.1
PWY-6146	1.2.7.6
PWY-6146	5.3.1.1
PWY-6146	4.1.2.13
PWY-6146	2.7.1.146
PWY-6146	5.3.1.9
PWY-6146	1.2.7.1
PWY-6146	1.2.7.4
PWY-6146	2.1.1.-
PWY-6146	1.5.1.5
PWY-6146	3.5.4.9
PWY-6146	6.3.4.3
PWY-6146	1.2.1.43
PWY-6146	1.1.1.37
PWY-6146	1.3.99.1
PWY-6146	1.2.7.3
PWY-6146	6.2.1.5
PWY-6146	4.2.1.2
PWY-6146	6.4.1.1
PWY-6146	1.2.7.1
PWY-6146	2.6.1.1
PWY-6146	2.6.1.2
PWY-6146	1.1.1.261
PWY-6146	2.5.1.41
PWY-6146	2.5.1.42
PWY-6146	2.7.7.67
PWY-6146	2.5.1.29
PWY-6146	5.3.3.2
PWY-6146	2.5.1.10
PWY-6146	2.5.1.1
PWY-6146	5.3.3.2
PWY-6146	4.1.1.33
To construct pathways from the generated EC file using MetaCyc pathways (using default mapping file in MinPath), run it as follows:
[uzi@quince-srv2 ~/Metaproteomics]$ MinPath1.2.py -any Sample_1.ec -map ec2path -report Sample_1.ec.report -details Sample_1.ec.details
[uzi@quince-srv2 ~/Metaproteomics]$ head Sample_1.ec.report
path 2 any n/a  naive 1  minpath 0  fam0  39  fam-found  2  name  PWY-6146
path 4 any n/a  naive 1  minpath 1  fam0  8  fam-found  1  name  BRANCHED-CHAIN-AA-SYN-PWY
path 11 any n/a  naive 1  minpath 1  fam0  9  fam-found  1  name  PWY-821
path 44 any n/a  naive 1  minpath 1  fam0  1  fam-found  1  name  GLUTSYNIII-PWY
path 46 any n/a  naive 1  minpath 0  fam0  6  fam-found  1  name  PWY-5505
path 49 any n/a  naive 1  minpath 1  fam0  8  fam-found  1  name  PWY-6549
path 53 any n/a  naive 1  minpath 0  fam0  13  fam-found  1  name  THREOCAT-PWY
path 55 any n/a  naive 1  minpath 0  fam0  5  fam-found  1  name  ILEUSYN-PWY
path 56 any n/a  naive 1  minpath 0  fam0  11  fam-found  1  name  PWY-3001
path 57 any n/a  naive 1  minpath 0  fam0  6  fam-found  1  name  PWY-5101
The last column reports the pathways, which you can search on MetaCyc website to get the detailed information. For example, here is the record for BRANCHED-CHAIN-AA-SYN-PWY on the website:



However, we are interested in KEGG pathways so that we can draw metabolic networks using iPath. We, therefore, generate a new mapping file using the REST-style KEGG API:
[uzi@quince-srv2 /home/opt/MinPath/data]$ for i in $(curl -S http://rest.kegg.jp/list/pathway | awk -F"\t" '{print $1}'); do curl -S http://rest.kegg.jp/link/ec/$i; done > ec2kegg
[uzi@quince-srv2 /home/opt/MinPath/data]$ awk -F"\t" '!/^$/{gsub("^path:","",$1);gsub("^ec:","",$2);print $1"\t"$2}' ec2kegg > ec2kegg_final
[uzi@quince-srv2 /home/opt/MinPath/data]$ chmod +x ec2kegg_final
[uzi@quince-srv2 /home/opt/MinPath/data]$ head -50 ec2kegg_final
map00010	1.1.1.1
map00010	1.1.1.2
map00010	1.1.1.27
map00010	1.1.2.7
map00010	1.1.2.8
map00010	1.2.1.12
map00010	1.2.1.3
map00010	1.2.1.5
map00010	1.2.1.59
map00010	1.2.1.9
map00010	1.2.4.1
map00010	1.2.7.1
map00010	1.2.7.6
map00010	1.8.1.4
map00010	2.3.1.12
map00010	2.7.1.1
map00010	2.7.1.11
map00010	2.7.1.146
map00010	2.7.1.147
map00010	2.7.1.2
map00010	2.7.1.40
map00010	2.7.1.41
map00010	2.7.1.63
map00010	2.7.1.69
map00010	2.7.2.3
map00010	3.1.3.10
map00010	3.1.3.11
map00010	3.1.3.13
map00010	3.1.3.9
map00010	3.2.1.86
map00010	4.1.1.1
map00010	4.1.1.32
map00010	4.1.1.49
map00010	4.1.2.13
map00010	4.2.1.11
map00010	5.1.3.15
map00010	5.1.3.3
map00010	5.3.1.1
map00010	5.3.1.9
map00010	5.4.2.11
map00010	5.4.2.12
map00010	5.4.2.2
map00010	5.4.2.4
map00010	6.2.1.1
map00010	6.2.1.13
map00020	1.1.1.286
map00020	1.1.1.37
map00020	1.1.1.41
map00020	1.1.1.42
map00020	1.1.5.4
Now run MinPath1.2.py with the new mapping file using -map ec2kegg_final in the command:
[uzi@quince-srv2 ~/Metaproteomics]$ MinPath1.2.py -any Sample_1.ec -map ec2kegg_final -report Sample_1.ec.report.kegg -details Sample_1.ec.details.kegg
[uzi@quince-srv2 ~/Metaproteomics]$ head Sample_1.ec.report.kegg
path 1 any n/a  naive 1  minpath 0  fam0  45  fam-found  3  name  map00010
path 2 any n/a  naive 1  minpath 0  fam0  24  fam-found  1  name  map00020
path 3 any n/a  naive 1  minpath 0  fam0  50  fam-found  1  name  map00030
path 4 any n/a  naive 1  minpath 0  fam0  63  fam-found  1  name  map00040
path 5 any n/a  naive 1  minpath 0  fam0  70  fam-found  2  name  map00051
path 6 any n/a  naive 1  minpath 0  fam0  46  fam-found  1  name  map00052
path 9 any n/a  naive 1  minpath 0  fam0  13  fam-found  1  name  map00062
path 10 any n/a  naive 1  minpath 0  fam0  30  fam-found  2  name  map00071
path 18 any n/a  naive 1  minpath 0  fam0  11  fam-found  2  name  map00190
path 19 any n/a  naive 1  minpath 0  fam0  3  fam-found  1  name  map00195
To get the description of the reported KEGG pathways, use the following one-liner:
[uzi@quince-srv2 ~/Metaproteomics]$ for i in $(awk '{print $14}' Sample_1.ec.report.kegg); do curl -S http://rest.kegg.jp/find/pathway/$i; done
path:map00010	Glycolysis / Gluconeogenesis
path:map00020	Citrate cycle (TCA cycle)
path:map00030	Pentose phosphate pathway
path:map00040	Pentose and glucuronate interconversions
path:map00051	Fructose and mannose metabolism
path:map00052	Galactose metabolism
path:map00062	Fatty acid elongation
path:map00071	Fatty acid degradation
path:map00190	Oxidative phosphorylation
path:map00195	Photosynthesis
path:map00230	Purine metabolism
path:map00240	Pyrimidine metabolism
path:map00250	Alanine, aspartate and glutamate metabolism
path:map00280	Valine, leucine and isoleucine degradation
path:map00281	Geraniol degradation
path:map00290	Valine, leucine and isoleucine biosynthesis
path:map00310	Lysine degradation
path:map00330	Arginine and proline metabolism
path:map00360	Phenylalanine metabolism
path:map00362	Benzoate degradation
path:map00380	Tryptophan metabolism
path:map00410	beta-Alanine metabolism
path:map00511	Other glycan degradation
path:map00531	Glycosaminoglycan degradation
path:map00592	alpha-Linolenic acid metabolism
path:map00600	Sphingolipid metabolism
path:map00604	Glycosphingolipid biosynthesis - ganglio series
path:map00620	Pyruvate metabolism
path:map00627	Aminobenzoate degradation
path:map00630	Glyoxylate and dicarboxylate metabolism
path:map00640	Propanoate metabolism
path:map00650	Butanoate metabolism
path:map00680	Methane metabolism
path:map00710	Carbon fixation in photosynthetic organisms
path:map00720	Carbon fixation pathways in prokaryotes
path:map00730	Thiamine metabolism
path:map00770	Pantothenate and CoA biosynthesis
path:map00903	Limonene and pinene degradation
path:map00910	Nitrogen metabolism
path:map00920	Sulfur metabolism
path:map00930	Caprolactam degradation
path:map01040	Biosynthesis of unsaturated fatty acids
path:map01100	Metabolic pathways
path:map01110	Biosynthesis of secondary metabolites
path:map01120	Microbial metabolism in diverse environments
We want to use interactive Pathways Explorer (iPath), which is a web-based tool for the visualization, analysis and customization of the various pathways maps. The metabolic pathways are constructed using 146 KEGG pathways, and give an overview of the complete metabolism in biological systems. We first extract the pathways in the format to be used with iPath and give each of them red color:
[uzi@quince-srv2 ~/Metaproteomics]$ awk '{gsub("^map","",$14);print $14" #ff0000"}' Sample_1.ec.report.kegg > Sample_1.ec.report.kegg.ipath
Now go to http://pathways.embl.de/index.html and click on iPath v2 interface to load the applet:



Next click on the "Customize" button to load the "Customization and data mapping" tab, and upload Sample_1.ec.report.kegg.ipath by clicking on "Select file" button. This will populate the "Element selection" text area. You can manually assign different colors to your pathways here. Finally, click on "Submit data and customize maps" button to generate your metabolic network.



Once generated, you can save the metabolic network as a PDF document to your local computer. The connections in red are the ones present in your sample:


You can then repeat the same analysis on your other samples.
Last Updated by Dr Umer Zeeshan Ijaz on 05/07/2014.