Scripts for Annotation

by Umer Zeeshan Ijaz and Chris Quince

Bash one-liners for extracting enzyme information from annotated GBK files
Script for extracting/drawing contigs from the GBK file. These contigs contain the enzymes we are interested in
Script for extracting all known enzyme sequences from Uniprot databases
Script for finding motifs within enzyme sequences
Scripts for searching CDS regions extracted from PROKKA against NCBI's CDD databases

Bash one-liners for extracting enzyme information from annotated GBK files

We will use rest-style KEGG API to resolve enzyme details in genbank files. The API uses KEGG ENZYME database, which is an implementation of the Enzyme Nomenclature (EC Numbers) on the ExplorEnz database, and is maintained in the KEGG LIGAND relational database with additional annotation of reaction hierarchy, organism information, and sequence data links.
To use these one-liners on your genbank files, replace test.gbk with the name of the file you are using. First step is to extract a tab-separated list of only those contigs which have enzymes in them (all other contigs are ignored):
[uzi@quince-srv2 ~/annotation]$ awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","ec:\\1","g")}/^LOCUS/{print ";"$2}'  test.gbk | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}'
NODE_2925_length_923_cov_1.698808	ec:4.2.1.17	
NODE_4520_length_55408_cov_7.328635	ec:1.11.1.6	ec:1.2.1.22	ec:1.1.99.1	ec:2.5.1.18	
NODE_6043_length_823_cov_1.500607	ec:6.2.1.25	
NODE_918_length_270_cov_10.744445	ec:1.5.3.1	
NODE_312_length_2459_cov_13.763318	ec:4.3.1.17	
NODE_282_length_4935_cov_13.040932	ec:2.7.6.5	ec:3.1.7.2	ec:2.7.7.6	ec:2.7.4.8	
NODE_3153_length_49002_cov_8.110995	ec:1.4.3.19	ec:4.1.3.39	ec:5.1.1.8	ec:3.5.4.22	ec:1.2.1.26	ec:1.1.1.46	ec:1.1.1.1	ec:4.1.1.28	
NODE_241_length_4648_cov_13.109509	ec:4.2.1.20	
NODE_533_length_2973_cov_13.546249	ec:4.2.1.20	
NODE_2986_length_180_cov_3.444444	ec:1.2.1.38	
NODE_2957_length_228_cov_5.881579	ec:4.2.1.52	
NODE_2313_length_214_cov_3.420561	ec:1.14.11.17	
NODE_2177_length_8710_cov_10.954764	ec:4.1.1.36	ec:6.3.2.5	ec:3.6.1.23	ec:5.4.2.8	ec:5.4.2.2	ec:2.7.2.8	ec:2.4.2.10	
NODE_2125_length_194_cov_3.505155	ec:1.2.1.27	
NODE_195_length_3240_cov_14.628086	ec:4.2.1.75	ec:4.2.1.75	
NODE_395_length_847_cov_14.246754	ec:2.1.1.45	
NODE_263_length_1894_cov_12.784055	ec:1.8.4.11	
NODE_2068_length_256_cov_5.703125	ec:6.3.2.10	
NODE_5094_length_82060_cov_7.647209	ec:5.2.1.8	ec:3.4.21.92	ec:2.5.1.54	ec:3.1.1.3	ec:5.2.1.8	ec:1.1.1.30	
NODE_1254_length_19746_cov_10.331156	ec:1.6.5.5	ec:1.5.1.29	
NODE_4044_length_223_cov_4.789237	ec:1.1.99.1	
NODE_2933_length_5310_cov_10.508098	ec:3.4.13.19	
NODE_2296_length_260_cov_4.046154	ec:2.7.1.24	
NODE_618_length_171_cov_3.111111	ec:2.3.1.31	
NODE_4435_length_1843_cov_2.593597	ec:6.1.1.9	
NODE_1478_length_868_cov_12.869816	ec:2.4.2.17	
NODE_518_length_1838_cov_14.368879	ec:2.1.1.176	ec:2.1.2.9	
NODE_4252_length_45804_cov_8.195944	ec:1.11.1.6	ec:1.11.1.7	ec:2.7.1.23	ec:6.2.1.3	ec:4.1.1.18	ec:2.7.7.7	ec:3.1.26.4	
NODE_1912_length_207_cov_4.357488	ec:2.1.1.13	
NODE_711_length_662_cov_13.104230	ec:2.7.4.7	
NODE_2276_length_2332_cov_2.129074	ec:1.1.1.25	
NODE_4612_length_2770_cov_2.385560	ec:4.6.1.1	
NODE_4102_length_25349_cov_10.313819	ec:2.7.7.6	ec:6.3.1.8	ec:1.11.1.6	
NODE_2532_length_27470_cov_9.829596	ec:4.3.1.12	ec:4.3.1.12	
NODE_2478_length_31408_cov_9.062372	ec:4.2.1.70	
NODE_1699_length_5660_cov_10.990990	ec:1.5.1.29	
NODE_153_length_1936_cov_12.737087	ec:1.5.1.3	
NODE_3727_length_13726_cov_9.037520	ec:1.1.5.2	
NODE_1207_length_2434_cov_10.899754	ec:3.1.26.5	
NODE_1695_length_238_cov_2.945378	ec:5.1.3.2	
NODE_4703_length_3098_cov_2.432537	ec:3.2.2.20	ec:6.1.1.14	
NODE_3161_length_31677_cov_10.626669	ec:1.2.1.3	ec:4.3.1.7	ec:4.3.1.7	
NODE_87_length_2776_cov_13.340418	ec:2.7.7.49	
NODE_495_length_2327_cov_12.959604	ec:2.5.1.18	
NODE_1135_length_515_cov_12.755340	ec:6.3.2.1	
NODE_2947_length_376_cov_14.313829	ec:1.4.1.13	
NODE_899_length_1304_cov_13.075920	ec:3.5.1.10	
NODE_4432_length_21460_cov_9.932059	ec:3.4.16.4	ec:5.4.3.8	ec:3.4.17.13	
NODE_5187_length_772_cov_2.908031	ec:2.7.6.5	ec:3.1.7.2	
NODE_5103_length_4544_cov_1.866417	ec:1.6.1.2	ec:1.6.1.2	
NODE_4340_length_1843_cov_2.208356	ec:2.1.2.11	
NODE_1053_length_980_cov_13.608163	ec:1.2.1.2	
NODE_3523_length_5106_cov_10.867215	ec:4.2.1.1	ec:3.5.4.16	
NODE_258_length_1705_cov_13.052199	ec:1.2.1.26	
...
The following one-liner extracts the list of all enzymes found in the genbank file and uses rest-style KEGG API to generate names from EC numbers:
[uzi@quince-srv2 ~/annotation]$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/find/enzyme/$i) | awk '{$1=$1"\t"}1'; done
ec:1.1.1.108	 carnitine 3-dehydrogenase
ec:1.11.1.15	 peroxiredoxin; thioredoxin peroxidase; tryparedoxin peroxidase; alkyl hydroperoxide reductase C22; AhpC; TrxPx; TXNPx; Prx; PRDX
ec:1.1.1.158	 Transferred to 1.3.1.98
ec:1.11.1.6	 catalase; equilase; caperase; optidase; catalase-peroxidase; CAT
ec:1.1.1.169	 2-dehydropantoate 2-reductase; 2-oxopantoate reductase; 2-ketopantoate reductase; 2-ketopantoic acid reductase; ketopantoate reductase; ketopantoic acid reductase
ec:1.11.1.7	 peroxidase; lactoperoxidase; guaiacol peroxidase; plant peroxidase; Japanese radish peroxidase; horseradish peroxidase (HRP); soybean peroxidase (SBP); extensin peroxidase; heme peroxidase; oxyperoxidase; protoheme peroxidase; pyrocatechol peroxidase; scopoletin peroxidase; Coprinus cinereus peroxidase; Arthromyces ramosus peroxidase
ec:1.11.1.9	 glutathione peroxidase; GSH peroxidase; selenium-glutathione peroxidase; reduced glutathione peroxidase
ec:1.1.1.205	 IMP dehydrogenase; inosine-5'-phosphate dehydrogenase; inosinic acid dehydrogenase; inosinate dehydrogenase; inosine 5'-monophosphate dehydrogenase; inosine monophosphate dehydrogenase; IMP oxidoreductase; inosine monophosphate oxidoreductase ec:1.2.1.14 Transferred to 1.1.1.205
ec:1.1.1.215	 gluconate 2-dehydrogenase; 2-keto-D-gluconate reductase; 2-ketogluconate reductase
ec:1.1.1.262	 4-hydroxythreonine-4-phosphate dehydrogenase; NAD+-dependent threonine 4-phosphate dehydrogenase; L-threonine 4-phosphate dehydrogenase; 4-(phosphohydroxy)-L-threonine dehydrogenase; PdxA; 4-(phosphonooxy)-L-threonine:NAD+ oxidoreductase
ec:1.1.1.274	 2,5-didehydrogluconate reductase (2-dehydro-D-gluconate-forming); 2,5-diketo-D-gluconate reductase (ambiguous)
ec:1.1.1.284	 S-(hydroxymethyl)glutathione dehydrogenase; NAD-linked formaldehyde dehydrogenase (incorrect); formaldehyde dehydrogenase (incorrect); formic dehydrogenase (incorrect); class III alcohol dehydrogenase; ADH3; chi-ADH; FDH (incorrect); formaldehyde dehydrogenase (glutathione) (incorrect); GS-FDH (incorrect); glutathione-dependent formaldehyde dehydrogenase (incorrect); NAD-dependent formaldehyde dehydrogenase; GD-FALDH; NAD- and glutathione-dependent formaldehyde dehydrogenase
ec:1.1.1.290	 4-phosphoerythronate dehydrogenase; PdxB; PdxB 4PE dehydrogenase; 4-O-phosphoerythronate dehydrogenase
ec:1.1.1.46	 L-arabinose 1-dehydrogenase
ec:1.1.1.47	 glucose 1-dehydrogenase [NAD(P)+]; D-glucose dehydrogenase (NAD(P)+); hexose phosphate dehydrogenase; beta-D-glucose:NAD(P)+ 1-oxidoreductase; glucose 1-dehydrogenase
...
For the extracted enzymes, we can list all the KEGG Ortholog (KO) groups each enzyme is part of, along with their detailed description. The KO system is the basis for representation for all proteins and functional RNAs that correspond to KEGG pathway nodes, BRITE hierarchy nodes, and KEGG module nodes:
[uzi@quince-srv2 ~/annotation]$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/link/ko/ec:$i | grep -Po '(?<=ko:).*' | sed -e "s/K/ko:K/g") | xargs -n 1 | xargs -I {} curl -s http://rest.kegg.jp/find/ko/{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:1.10.2.2	ko:K00411	UQCRFS1, RIP1, petA; ubiquinol-cytochrome c reductase iron-sulfur subunit [EC:1.10.2.2]
ec:1.1.1.1	ko:K00001	E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
ec:1.1.1.1	ko:K00121	frmA, ADH5, adhC; S-(hydroxymethyl)glutathione dehydrogenase / alcohol dehydrogenase [EC:1.1.1.284 1.1.1.1]
ec:1.1.1.1	ko:K04072	adhE; acetaldehyde dehydrogenase / alcohol dehydrogenase [EC:1.2.1.10 1.1.1.1]
ec:1.1.1.1	ko:K11440	gbsB; choline dehydrogenase [EC:1.1.1.1]
ec:1.1.1.1	ko:K13951	ADH1_7; alcohol dehydrogenase 1/7 [EC:1.1.1.1]
ec:1.1.1.1	ko:K13952	ADH6; alcohol dehydrogenase 6 [EC:1.1.1.1]
ec:1.1.1.1	ko:K13953	adhP; alcohol dehydrogenase, propanol-preferring [EC:1.1.1.1]
ec:1.1.1.1	ko:K13954	yiaY; alcohol dehydrogenase [EC:1.1.1.1]
ec:1.1.1.1	ko:K13980	ADH4; alcohol dehydrogenase 4 [EC:1.1.1.1]
ec:1.1.1.100	ko:K00059	fabG; 3-oxoacyl-[acyl-carrier protein] reductase [EC:1.1.1.100]
ec:1.1.1.100	ko:K11610	mabA; beta-ketoacyl ACP reductase [EC:1.1.1.100]
ec:1.1.1.108	ko:K17735	lcdH, cdhA; carnitine 3-dehydrogenase [EC:1.1.1.108]
ec:1.11.1.15	ko:K03386	E1.11.1.15, PRDX, ahpC; peroxiredoxin (alkyl hydroperoxide reductase subunit C) [EC:1.11.1.15]
ec:1.11.1.15	ko:K03564	BCP, PRXQ, DOT5; peroxiredoxin Q/BCP [EC:1.11.1.15]
ec:1.11.1.15	ko:K11065	tpx; thiol peroxidase, atypical 2-Cys peroxiredoxin [EC:1.11.1.15]
ec:1.11.1.15	ko:K11185	TRYP; cytosolic tryparedoxin peroxidase, trypanosomatid typical 2-Cys peroxiredoxin [EC:1.11.1.15]
ec:1.11.1.15	ko:K11186	MTRYP; mitochondrial tryparedoxin peroxidase, trypanosomatid typical 2-Cys peroxiredoxin [EC:1.11.1.15]
ec:1.11.1.15	ko:K11187	PRDX5; peroxiredoxin 5, atypical 2-Cys peroxiredoxin [EC:1.11.1.15]
ec:1.11.1.15	ko:K11188	PRDX6; peroxiredoxin 6, 1-Cys peroxiredoxin [EC:1.11.1.7 1.11.1.15 3.1.1.-]
ec:1.11.1.15	ko:K13279	PRDX1; peroxiredoxin 1 [EC:1.11.1.15]
ec:1.11.1.15	ko:K14171	AHP1; alkyl hydroperoxide reductase 1 [EC:1.11.1.15]
ec:1.1.1.132	ko:K00066	algD; GDP-mannose 6-dehydrogenase [EC:1.1.1.132]
ec:1.1.1.133	ko:K00067	rfbD, rmlD; dTDP-4-dehydrorhamnose reductase [EC:1.1.1.133]
ec:1.11.1.5	ko:K00428	E1.11.1.5; cytochrome c peroxidase [EC:1.11.1.5]
ec:1.1.1.157	ko:K00074	paaH, hbd, fadB, mmgB; 3-hydroxybutyryl-CoA dehydrogenase [EC:1.1.1.157]
ec:1.11.1.6	ko:K03781	katE, CAT, catB, srpA; catalase [EC:1.11.1.6]
ec:1.1.1.169	ko:K00077	panE, apbA; 2-dehydropantoate 2-reductase [EC:1.1.1.169]
ec:1.11.1.7	ko:K00430	E1.11.1.7; peroxidase [EC:1.11.1.7]
ec:1.11.1.7	ko:K10788	EPX, EPO; eosinophil peroxidase [EC:1.11.1.7]
ec:1.11.1.7	ko:K11188	PRDX6; peroxiredoxin 6, 1-Cys peroxiredoxin [EC:1.11.1.7 1.11.1.15 3.1.1.-]
ec:1.11.1.7	ko:K12550	LPO; lactoperoxidase [EC:1.11.1.7]
ec:1.11.1.9	ko:K00432	E1.11.1.9; glutathione peroxidase [EC:1.11.1.9]
ec:1.1.1.205	ko:K00088	guaB; IMP dehydrogenase [EC:1.1.1.205]
ec:1.1.1.215	ko:K00090	E1.1.1.215; gluconate 2-dehydrogenase [EC:1.1.1.215]
ec:1.1.1.22	ko:K00012	UGDH, ugd; UDPglucose 6-dehydrogenase [EC:1.1.1.22]
ec:1.1.1.23	ko:K00013	hisD; histidinol dehydrogenase [EC:1.1.1.23]
ec:1.1.1.23	ko:K14152	HIS4; phosphoribosyl-ATP pyrophosphohydrolase / phosphoribosyl-AMP cyclohydrolase / histidinol dehydrogenase [EC:3.6.1.31 3.5.4.19 1.1.1.23]
ec:1.1.1.25	ko:K00014	aroE; shikimate dehydrogenase [EC:1.1.1.25]
ec:1.1.1.25	ko:K13830	ARO1; pentafunctional AROM polypeptide [EC:4.2.3.4 4.2.1.10 1.1.1.25 2.7.1.71 2.5.1.19]
ec:1.1.1.25	ko:K13832	aroDE, DHQ-SDH; 3-dehydroquinate dehydratase / shikimate dehydrogenase [EC:4.2.1.10 1.1.1.25]
ec:1.1.1.26	ko:K00015	gyaR; glyoxylate reductase [EC:1.1.1.26]
...
Similarly, for the extracted enzymes, we can also list all the known reactions these enzymes are a part of. Each reaction is identified by the R number and is linked to ortholog groups of enzymes enabling integrated analysis of genomic and chemical information.
[uzi@quince-srv2 ~/annotation]$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/link/reaction/ec:$i | grep -Po '(?<=rn:).*') | xargs -n 1 | xargs -I {} curl -s http://rest.kegg.jp/find/reaction/rn:{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:1.10.2.2	rn:R02161	Ubiquinol:ferricytochrome-c oxidoreductase; Ubiquinol + 2 Ferricytochrome c <=> Ubiquinone + 2 Ferrocytochrome c + 2 H+
ec:1.1.1.1	rn:R00228	acetaldehyde:NAD+ oxidoreductase (CoA-acetylating); Acetaldehyde + CoA + NAD+ <=> Acetyl-CoA + NADH + H+
ec:1.1.1.1	rn:R00623	primary_alcohol:NAD+ oxidoreductase; Primary alcohol + NAD+ <=> Aldehyde + NADH + H+
ec:1.1.1.1	rn:R00624	Secondary_alcohol:NAD+ oxidoreductase; Secondary alcohol + NAD+ <=> Ketone + NADH + H+
ec:1.1.1.1	rn:R00754	ethanol:NAD+ oxidoreductase; Ethanol + NAD+ <=> Acetaldehyde + NADH + H+
ec:1.1.1.1	rn:R01172	butanal:NAD+ oxidoreductase (CoA-acylating); Butanal + CoA + NAD+ <=> Butanoyl-CoA + NADH + H+
ec:1.1.1.1	rn:R02124	retinol:NAD+ oxidoreductase; Retinol + NAD+ <=> Retinal + NADH + H+
ec:1.1.1.1	rn:R02878	1-Octanol:NAD+ oxidoreductase; 1-Octanol + NAD+ <=> 1-Octanal + NADH + H+
ec:1.1.1.1	rn:R04805	3alpha,7alpha,26-Trihydroxy-5beta-cholestane + NAD+ <=> 3alpha,7alpha-Dihydroxy-5beta-cholestan-26-al + NADH + H+
ec:1.1.1.1	rn:R04880	3,4-dihydroxyphenylethyleneglycol:NAD+ oxidoreductase; 3,4-Dihydroxyphenylethyleneglycol + NAD+ <=> 3,4-Dihydroxymandelaldehyde + NADH + H+
ec:1.1.1.1	rn:R05233	trans-3-Chloro-2-propene-1-ol:NAD+ oxidoreductase; trans-3-Chloro-2-propene-1-ol + NAD+ <=> trans-3-Chloroallyl aldehyde + NADH + H+
ec:1.1.1.1	rn:R05234	cis-3-chloro-2-propene-1-ol:NAD+ oxidoreductase; cis-3-Chloro-2-propene-1-ol + NAD+ <=> cis-3-Chloroallyl aldehyde + NADH + H+
ec:1.1.1.1	rn:R06917	1-hydroxymethylnaphthalene:NAD+ oxidoreductase; 1-Hydroxymethylnaphthalene + NAD+ <=> 1-Naphthaldehyde + NADH + H+
ec:1.1.1.1	rn:R06927	(2-naphthyl)methanol:NAD+ oxidoreductase; (2-Naphthyl)methanol + NAD+ <=> 2-Naphthaldehyde + NADH + H+
ec:1.1.1.1	rn:R06983	S-(hydroxymethyl)glutathione dehydrogenase; S-(Hydroxymethyl)glutathione + NAD+ <=> S-Formylglutathione + NADH + H+
ec:1.1.1.1	rn:R07105	trichloroethanol:NAD+ oxidoreductase; Chloral hydrate + NADH + H+ <=> Trichloroethanol + NAD+ + H2O
ec:1.1.1.1	rn:R07326	alcohol:NAD+ oxidoreductase; Alcohol + NAD+ <=> Aldehyde + NADH + H+
ec:1.1.1.1	rn:R07327	alcohol:NAD+ oxidoreductase; Alcohol + NAD+ <=> Ketone + NADH + H+
ec:1.1.1.1	rn:R08281	alcophosphamide:NAD+ oxidoreductase; Aldophosphamide + NADH + H+ <=> Alcophosphamide + NAD+
ec:1.1.1.1	rn:R08306	2-phenyl-1,3-propanediol monocarbamate:NAD+ oxidoreductase; 2-Phenyl-1,3-propanediol monocarbamate + NAD+ <=> 3-Carbamoyl-2-phenylpropionaldehyde + NADH + H+
ec:1.1.1.1	rn:R08310	4-hydroxy-5-phenyltetrahydro-1,3-oxazin-2-one:NAD+ oxidoreductase; 4-Hydroxy-5-phenyltetrahydro-1,3-oxazin-2-one + NAD+ <=> 5-Phenyl-1,3-oxazinane-2,4-dione + NADH + H+
ec:1.1.1.1	rn:R08557	Choline + NAD+ <=> Betaine aldehyde + NADH + H+
ec:1.1.1.1	rn:R08558	Choline + NADP+ <=> Betaine aldehyde + NADPH + H+
ec:1.1.1.100	rn:R02767	(3R)-3-Hydroxyacyl-[acyl-carrier-protein]:NADP+ oxidoreductase; (3R)-3-Hydroxyacyl-[acyl-carrier protein] + NADP+ <=> 3-Oxoacyl-[acyl-carrier protein] + NADPH + H+
ec:1.1.1.100	rn:R04533	(3R)-3-Hydroxybutanoyl-[acyl-carrier protein]:NADP+ oxidoreductase; (3R)-3-Hydroxybutanoyl-[acyl-carrier protein] + NADP+ <=> Acetoacetyl-[acp] + NADPH + H+
ec:1.1.1.100	rn:R04534	(3R)-3-Hydroxydecanoyl-[acyl-carrier-protein]:NADP+ oxidoreductase; (3R)-3-Hydroxydecanoyl-[acyl-carrier protein] + NADP+ <=> 3-Oxodecanoyl-[acp] + NADPH
ec:1.1.1.100	rn:R04536	(3R)-3-Hydroxyoctanoyl-[acyl-carrier-protein]:NADP+ oxidoreductase; (3R)-3-Hydroxyoctanoyl-[acyl-carrier protein] + NADP+ <=> 3-Oxooctanoyl-[acp] + NADPH + H+
ec:1.1.1.100	rn:R04543	(3R)-3-Hydroxypalmitoyl-[acyl-carrier-protein]:NADP+ oxidoreductase; (3R)-3-Hydroxypalmitoyl-[acyl-carrier protein] + NADP+ <=> 3-Oxohexadecanoyl-[acp] + NADPH + H+
ec:1.1.1.100	rn:R04566	(3R)-3-Hydroxytetradecanoyl-[acyl-carrier-protein]:NADP+ oxidoreductase; (3R)-3-Hydroxytetradecanoyl-[acyl-carrier protein] + NADP+ <=> 3-Oxotetradecanoyl-[acp] + NADPH + H+
ec:1.1.1.100	rn:R04953	(3R)-3-Hydroxyhexanoyl-[acyl-carrier-protein]:NADP+ oxidoreductase; (R)-3-Hydroxyhexanoyl-[acp] + NADP+ <=> 3-Oxohexanoyl-[acp] + NADPH + H+
ec:1.1.1.100	rn:R04964	(3R)-3-Hydroxydodecanoyl-[acyl-carrier-protein]:NADP+ oxidoreductase; (R)-3-Hydroxydodecanoyl-[acp] + NADP+ <=> 3-Oxododecanoyl-[acp] + NADPH + H+
ec:1.1.1.100	rn:R07759	3-Oxostearoyl-CoA + NADPH + H+ <=> 3-Hydroxyoctadecanoyl-CoA + NADP+
ec:1.1.1.100	rn:R07763	3-Oxostearoyl-[acp] + NADPH + H+ <=> 3-Hydroxyoctadecanoyl-[acp] + NADP+
ec:1.1.1.100	rn:R10116	3-hydroxyglutaryl-[acp]-methyl-ester:NADP+ oxidoreductase; 3-Ketoglutaryl-[acp] methyl ester + NADPH + H+ <=> 3-Hydroxyglutaryl-[acp] methyl ester + NADP+
ec:1.1.1.100	rn:R10120	3-hydroxypimeloyl-[acp]-methyl-ester:NADP+ oxidoreductase; 3-Ketopimeloyl-[acp] methyl ester + NADPH + H+ <=> 3-Hydroxypimeloyl-[acp] methyl ester + NADP+
ec:1.1.1.108	rn:R02395	Carnitine:NAD+ 3-oxidoreductase; Carnitine + NAD+ <=> 3-Dehydrocarnitine + NADH + H+
ec:1.11.1.15	rn:R00602	methanol:hydrogen-peroxide oxidoreductase; Methanol + Hydrogen peroxide <=> Formaldehyde + 2 H2O
ec:1.11.1.15	rn:R00698	L-phenylalanine:oxygen oxidoreductase (decarboxylating); L-Phenylalanine + Oxygen <=> 2-Phenylacetamide + CO2
ec:1.11.1.15	rn:R02596	Donor:hydrogen-peroxide oxidoreductase; Coniferyl alcohol <=> Guaiacyl lignin
ec:1.11.1.15	rn:R03919	Donor:hydrogen-peroxide oxidoreductase; Sinapyl alcohol <=> Syringyl lignin
ec:1.11.1.15	rn:R04007	Donor:hydrogen-peroxide oxidoreductase; 4-Coumaryl alcohol <=> p-Hydroxyphenyl lignin
ec:1.11.1.15	rn:R07180	thiol-containing-reductant:hydroperoxide oxidoreductase; ROOH + 2 Thiol-containing reductant <=> Alcohol + H2O + Oxidized thiol-containing reductant
ec:1.11.1.15	rn:R07443	donor:hydrogen-peroxide oxidoreductase; 5-Hydroxyconiferyl alcohol <=> 5-Hydroxy-guaiacyl lignin
ec:1.11.1.15	rn:R08360	tryparedoxin:hydroperoxide oxidoreductase; Tryparedoxin + ROOH <=> Tryparedoxin disulfide + H2O + ROH
...
We may also be interested in knowing which other organisms contain the same enzymes:
[uzi@quince-srv2 ~/annotation]$ for i in $(awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}'  test.gbk | sort | uniq); do echo $(curl -s http://rest.kegg.jp/link/genome/ec:$i | grep -Po '(?<=genome:).*') | xargs -n 1 | xargs -I {} curl -s http://rest.kegg.jp/find/genome/genome:{} | awk -v k=$i '{print "ec:"k"\t"$0}' ; done
ec:1.10.2.2	genome:T01003	rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.10.2.2	genome:T01008	bta, BOVIN, 9913; Bos taurus (cow)
ec:1.1.1.1	genome:T00030	dme, DROME, 7227; Drosophila melanogaster (fruit fly)
ec:1.1.1.1	genome:T01001	hsa, HUMAN, 9606; Homo sapiens (human)
ec:1.1.1.1	genome:T01003	rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.1.1.1	genome:T01058	ecb, HORSE, 9796; Equus caballus (horse)
ec:1.1.1.100	genome:T00007	eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.1.1.108	genome:T00035	pae, PSEAE, 208964; Pseudomonas aeruginosa PAO1
ec:1.11.1.15	genome:T01001	hsa, HUMAN, 9606; Homo sapiens (human)
ec:1.11.1.15	genome:T01003	rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.11.1.5	genome:T00005	sce, YEAST, 559292; Saccharomyces cerevisiae S288c
ec:1.1.1.157	genome:T00564	ckl, CLOK5, 431943; Clostridium kluyveri DSM 555
ec:1.11.1.6	genome:T00115	lpl, LACPL, 220668; Lactobacillus plantarum WCFS1
ec:1.11.1.6	genome:T01001	hsa, HUMAN, 9606; Homo sapiens (human)
ec:1.1.1.169	genome:T00005	sce, YEAST, 559292; Saccharomyces cerevisiae S288c
ec:1.1.1.169	genome:T00007	eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.11.1.7	genome:T01008	bta, BOVIN, 9913; Bos taurus (cow)
ec:1.11.1.9	genome:T01008	bta, BOVIN, 9913; Bos taurus (cow)
ec:1.1.1.205	genome:T00566	kpn, KLEP7, 272620; Klebsiella pneumoniae subsp. pneumoniae MGH 78578
ec:1.1.1.22	genome:T01008	bta, BOVIN, 9913; Bos taurus (cow)
ec:1.1.1.23	genome:T00007	eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.1.1.23	genome:T00065	stm, SALTY, 99287; Salmonella enterica subsp. enterica serovar Typhimurium LT2 (Salmonella typhimurium LT2)
ec:1.1.1.25	genome:T00005	sce, YEAST, 559292; Saccharomyces cerevisiae S288c
ec:1.1.1.25	genome:T00007	eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.1.1.25	genome:T00566	kpn, KLEP7, 272620; Klebsiella pneumoniae subsp. pneumoniae MGH 78578
ec:1.1.1.262	genome:T00007	eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.1.1.28	genome:T00115	lpl, LACPL, 220668; Lactobacillus plantarum WCFS1
ec:1.1.1.284	genome:T00005	sce, YEAST, 559292; Saccharomyces cerevisiae S288c
ec:1.1.1.284	genome:T00440	pde, PARDP, 318586; Paracoccus denitrificans PD1222
ec:1.1.1.284	genome:T01001	hsa, HUMAN, 9606; Homo sapiens (human)
ec:1.1.1.284	genome:T01002	mmu, MOUSE, 10090; Mus musculus (house mouse)
ec:1.1.1.290	genome:T00007	eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.1.1.3	genome:T00005	sce, YEAST, 559292; Saccharomyces cerevisiae S288c
ec:1.1.1.3	genome:T00007	eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.1.1.30	genome:T01003	rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.1.1.31	genome:T01009	ssc, PIG, 9823; Sus scrofa (pig)
ec:1.1.1.35	genome:T01003	rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.1.1.35	genome:T01008	bta, BOVIN, 9913; Bos taurus (cow)
ec:1.1.1.35	genome:T01009	ssc, PIG, 9823; Sus scrofa (pig)
ec:1.1.1.42	genome:T00011	afu, ARCFU, 224325; Archaeoglobus fulgidus DSM 4304 (VC-16)
ec:1.1.1.42	genome:T00054	sso, SULSO, 273057; Sulfolobus solfataricus P2
ec:1.1.1.42	genome:T01003	rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.1.1.42	genome:T01009	ssc, PIG, 9823; Sus scrofa (pig)
ec:1.1.1.42	genome:T01012	tcr, TRYCR, 353153; Trypanosoma cruzi CL Brener
ec:1.1.1.44	genome:T00007	eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.1.1.44	genome:T00010	bsu, BACSU, 224308; Bacillus subtilis subsp. subtilis 168
ec:1.1.1.44	genome:T01003	rno, RAT, 10116; Rattus norvegicus (Norway rat)
ec:1.1.1.47	genome:T01008	bta, BOVIN, 9913; Bos taurus (cow)
ec:1.1.1.47	genome:T01009	ssc, PIG, 9823; Sus scrofa (pig)
ec:1.1.1.49	genome:T00005	sce, YEAST, 559292; Saccharomyces cerevisiae S288c
ec:1.1.1.49	genome:T00007	eco, ECOLI, 511145; Escherichia coli K-12 MG1655
ec:1.1.1.49	genome:T00013	aae, AQUAE, 224324; Aquifex aeolicus VF5
ec:1.1.1.49	genome:T00022	tma, THEMA, 243274; Thermotoga maritima MSB8
ec:1.1.1.49	genome:T01001	hsa, HUMAN, 9606; Homo sapiens (human)
ec:1.1.1.49	genome:T01009	ssc, PIG, 9823; Sus scrofa (pig)
ec:1.1.1.49	genome:T01013	tbr, 999953; Trypanosoma brucei brucei 927/4 GUTat10.1
ec:1.1.1.49	genome:T01034	ncr, 367110; Neurospora crassa OR74A
...

Script for extracting/drawing contigs from the GBK file. These contigs contain the enzymes we are interested in

To extract/draw contigs that match a particular enzyme you are interested in, use the genbank_enzymes.py script. The usage information is as follows:
[uzi@quince-srv2 ~/annotation]$ python genbank_enzymes.py 
Usage:
	python genbank_enzymes.py -g <gbk_file> -o <output_folder> -l <list_file> [OPTIONS]

Options:
	-s (--font_size) 	Size of labels on diagrams (Default:12)
	-f (--fragments) 	Number of fragments to split the genome diagram (Default:1)
	-n (--name_length) 	Maximum length of names for output files (Default:10)

Say we are interested in extracting only those contigs that contain the two enzymes provided in IDs.txt:
[uzi@quince-srv2 ~/annotation]$ cat IDs.txt
catalase
1.1.1.22

We run the script as follows on our test.gbk file as follows:
[uzi@quince-srv2 ~/annotation]$ python genbank_enzymes.py -g test.gbk -o test -l IDs.txt -s 8 -n 9
Found catalase in NODE_4520_length_55408_cov_7.328635
	Generated test/NODE_4520.gbk, test/NODE_4520.fasta, and test/NODE_4520.pdf
Found catalase in NODE_3283_length_187584_cov_8.268562
	Generated test/NODE_3283.gbk, test/NODE_3283.fasta, and test/NODE_3283.pdf
Found catalase in NODE_4252_length_45804_cov_8.195944
	Generated test/NODE_4252.gbk, test/NODE_4252.fasta, and test/NODE_4252.pdf
Found catalase in NODE_4102_length_25349_cov_10.313819
	Generated test/NODE_4102.gbk, test/NODE_4102.fasta, and test/NODE_4102.pdf
Found catalase in NODE_1605_length_312_cov_4.407051
	Generated test/NODE_1605.gbk, test/NODE_1605.fasta, and test/NODE_1605.pdf
Found 1.1.1.22 in NODE_5958_length_795089_cov_7.501846
	Generated test/NODE_5958.gbk, test/NODE_5958.fasta, and test/NODE_5958.pdf
Found catalase in NODE_854_length_21339_cov_11.382820
	Generated test/NODE_854_.gbk, test/NODE_854_.fasta, and test/NODE_854_.pdf
Found 1.1.1.22 in NODE_5900_length_61310_cov_7.799576
	Generated test/NODE_5900.gbk, test/NODE_5900.fasta, and test/NODE_5900.pdf

Note that the script also generates annotated diagrams for contigs with enzymes on them. On these diagrams, hypothetical proteins are drawn as orange boxes, any other CDS with known information as light blue boxes, enzymes that match the EC numbers/names provided in the list as red arrows, and other enzymes as green arrows, respectively.


NODE_4252.pdf

NODE_5900.pdf

Script for extracting all known enzyme sequences from Uniprot databases

There are two types of databases available at ftp.expasy.org: uniprot_sprot.dat.gz that is manually curated and reviewed; and uniprot_trmbl.dat.gz that is automatically curated but not reviewed. These databases are the backend for what you see on www.uniprot.org. You can download these databases and can extract records from them using the extract_fasta_swissprot.py script. One of the output header formats can also give you taxonomy IDs to help you place the sequences on the phylogenetic tree. Please note that the search pattern is case-sensitive. The usage information is as follows:
[uzi@quince-srv2 ~/annotation]$ python extract_fasta_swissprot.py --help
Example usages:
        python extract_fasta_swissprot.py -i uniprot_sprot.dat.gz -p "ATP16" -f 5 -m 4 > ATP16.fasta
        python extract_fasta_swissprot.py -i uniprot_sprot.dat.gz -p "Acetyl-CoA hydrolase" > Acetyl_CoA_hydrolase.fasta
        python extract_fasta_swissprot.py -i uniprot_sprot.dat.gz -p "Candida albicans" -f 2 > Candida_albicans.fasta
Parameters:
         -i (--input)    Compressed SwissProt file
         -p (--pattern)  Search pattern
         -m (--format)   Header format
Available search types:
        -f 1    Protein Names (including Alternative Name) [DEFAULT]
        -f 2    Organism Name
        -f 3    Keywords
        -f 4    Comments
        -f 5    Gene Name
        -f 6    Accession Number
        -f 7    Organism Classification
Available header formats:
        -m 1    >Accession IDs | Organism Name [DEFAULT]
        -m 2    >Accesion IDs | Gene Name | Organism Name
        -m 3    >Entry ID
        -m 4    >Entry ID | Taxonomy IDs

For example, you can find all the Acetyl-CoA hydrolases (EC 3.1.2.1) using:
[uzi@quince-srv2 ~/annotation]$ python extract_fasta_swissprot.py -i uniprot_sprot.dat.gz -p "Acetyl-CoA hydrolase" > Acetyl_CoA_hydrolase.fasta
You can then align all the sequences using ClustalW. The alignment file generated for the above Acetyl_CoA_hydrolase.fasta is as follows:
CLUSTAL 2.1 multiple sequence alignment
P83773_Q59YK9|Candida_albicans      --------MSAILKQRVKYAPYLKKLRT-GEQCIDLF--KHGQYLGWSGF 39
Q6BKW1|Debaryomyces_hansenii        --------MSSILKQRVRYAPYLQKLRT-AEQCVELF--KHGQYLGWSGF 39
Q6C3Z9|Yarrowia_lipolytica          --------MSAVLKQRIRYKPYLDKIRT-AAQCVDLFGAKPNQYIGWSGF 41
Q6FPF3|Candida_glabrata             ------MTVSKLLKERVRYAPYLAKVKT-PEELIPLF--KNGQYLGWSGF 41
P32316_D6VPY5|Saccharomyces_ce      ------MTISNLLKQRVRYAPYLKKVKE-AHELIPLF--KNGQYLGWSGF 41
Q6CNR2|Kluyveromyces_lactis         ------MTVSRLLKDRVRYAPYLKKVKP-VEELIPLF--KDGQYIGWSGF 41
Q754Q2|Ashbya_gossypii              ------MTVSQLLKQRVRYAPYLSKVRR-AEELLPLF--KHGQYIGWSGF 41
Q9UUJ9_P78872|Schizosaccharomy      --------MPSLLSSRIRNAEFAKKIVTDPAKLVPYF--KNGDYVGWSGF 40
P15937_Q7RV87|Neurospora_crass      ---MASPIASAALRARVKRPSMLKKLCN-PEDMLQHF--PNGAYIGWSGF 44
Q54K91|Dictyostelium_discoideu      MHKIVQQGSKSLIESRIKRKSLLSKVVTDVSQLIPYF--QNGHYVSMGGF 48
					        :  *::      *:     . :  *    . *:. .**
P83773_Q59YK9|Candida_albicans      TGVGAPKVIPTTLVDHVEKNNLQG--KLGFHLFVGASAGPE-ESRWAENN 86
Q6BKW1|Debaryomyces_hansenii        TGVGAPKVVPDAIADHVEKNNLQG--KLGFNLFVGASAGPE-EGRWAEND 86
Q6C3Z9|Yarrowia_lipolytica          TGVGAPKVVPDAVSKHVEENNLQGHENWRYNLFVGASAGMEIESRWANNN 91
Q6FPF3|Candida_glabrata             TGVGTPKAVPDALIKHVEDNNLQG--KLRFNLFVGASAGPE-ENKWAEHD 88
P32316_D6VPY5|Saccharomyces_ce      TGVGTPKAVPEALIDHVEKNNLQG--KLRFNLFVGASAGPE-ENRWAEHD 88
Q6CNR2|Kluyveromyces_lactis         TGVGAPKAVPEALIKHVEENNLQG--KLRFNLFVGASAGPE-ECKWAEHD 88
Q754Q2|Ashbya_gossypii              TGVGAPKVIPTALADHVEKNGLQG--QLAFNLFVGASAGPE-ENRWADLD 88
Q9UUJ9_P78872|Schizosaccharomy      TGVGYPKMIPNALARHVEENNLQG--KLRFKLFVGASGGADTECKWAELN 88
P15937_Q7RV87|Neurospora_crass      TGVGYPKKVPTMLADHVEKNGLQG--QLKYSLFVGASAGAETENRWAALD 92
Q54K91|Dictyostelium_discoideu      AGTGYPKVVPIALADHVEKNGLQG--KLKMNLFVGASVGPETEDRWAMLD 96
As another example, you can find all the proteins for the organism Candida albicans as:
[uzi@quince-srv2 ~/annotation]$ python extract_fasta_swissprot.py -i uniprot_sprot.dat.gz -p "Candida albicans" -f 2 > Candida_albicans.fasta
The FASTA file will have the following records:
>O42766:Q5AI38|Candida_albicans
MPASREDSVYLAKLAEQAERYEEMVENMKAVASSGQELSVEERNLLSVAYKNVIGARRAS
WRIVSSIEQKEEAKGNESQVALIRDYRAKIEAELSKICEDILSVLSDHLITSAQTGESKV
FYYKMKGDYHRYLAEFAIAEKRKEAADLSLEAYKAASDVAVTELPPTHPIRLGLALNFSV
FYYEILNSPDRACHLAKQAFDDAVADLETLSEDSYKDSTLIMQLLRDNLTLWTDLSEAPA
ATEEQQQSSQAPAAQPTEGKADQE
>Q59Z17:Q59Z79|Candida_albicans
MVKKVLLINGPNLNLLGTREPEKYGTTSLSDIEQAAIEQAKLKNNDSEVLAFQSNTEGFI
IDRIHEAKRQGVGFVVINAGAYTHTSVGIRDALLGTAIPFIEVHITNVHQREPFRHQSYL
SDKAVAVICGLGVYGYTAAIEYALNY
>Q59K86|Candida_albicans
MVLGQPINIIKWIEENGDLLKPPVNNFCLHRGGFTIMIVGGPNERSDYHINQTPEYFYQF
KGTMCLKVVDDGEFKDIFINEGDSFLLPPNVPHNPCRYENTIGIVVEQDRPAGVNDKVRW
YCQKCQTVIHEVEFYLTDLGTQIKEAIVKFDADLDARTCKNCGTVNSSRRD
>O13287|Candida_albicans
MKNFNALSRLSILSKQLSFNNTNSSIARGDIGLIGLAVMGQNLILNMADHGYTVVAYNRT
TAKVDRFLENEAKGKSILGAHSIKELVDQLKRPRRIMLLVKAGAPVDEFINQLLPYLEEG
DIIIDGGNSHFPDSNRRYEELAKKGILFVGSGVSGGEEGARTGPSLMPGGNEKAWPHIKE
IFQDVAAKSDGEPCCDWVGDAGAGHYVKMVHNGIEYGDMQLICEAYDLMKRVGKFEDKEI
GDVFATWNKGVLDSFLIEITRDILYYNDPTDGKPLVEKILDTAGQKGTGKWTAVNALDLG
IPVTLIGEAVFSRCLSAMKAERVEASKALKGPQVTGESPITDKKQFIDDLEQALYASKII
SYTQGFMLMNQAAKDYGWKLNNAGIALMWRGGCIIRSVFLAEITAAYRKKPDLENLLLYP
FFNDAITKAQSGWRASVGKAIQYGIPTPAFSTALAFYDGLRSERLPANLLQAQRDYFGAH
TFKVLPGQENELLKKDEWIHINWTGRGGDVSSTTYDA
...

Script for finding motifs within enzyme sequences

The site scansite.mit.edu provides a service that searches for motifs within proteins that are likely to be phosphorylated by specific protein kinases or bind to domains such as SH2 domains, 14-3-3 domains or PDZ domains. The extract_scansite_motifs_fasta.pl script searches a given FASTA file (comprising protein sequences) against the scansite server and outputs a tab-delimited file containing all the significant motifs. For example, you can run the script on Acetyl_CoA_hydrolase.fasta generated previously as:
[uzi@quince-srv2 ~/annotation]$ perl extract_scansite_motifs_fasta.pl -f Acetyl_coa_hydrolase.fasta
ID      			Kinase_name     Percentile      Position_in_protein     Score   Sequence        Phosphorylated_residue  Zscore
Q754Q2|Ashbya_gossypii  		Amphi_SH3       0.102   195     0.4737  GLHDIDMPVLPPHRV P195    -4.262
Q754Q2|Ashbya_gossypii  		CapC_SH3        0.076   222     0.5588  LDAVPVDPARVVALV P222    -4.497
P83773:Q59YK9|Candida_albicans  	PKC_epsilon     0.054   492     0.3526  FYATKKKTLHEPHIL T492    -3.338
P83773:Q59YK9|Candida_albicans  	ErkDD   	0.110   216     0.5485  FKIGKTAIPVDPEKV I216    -3.583
Q6FPF3|Candida_glabrata 		Amphi_SH3       0.102   195     0.4737  GVHDIDMPVNPPHRV P195    -4.262
Q6FPF3|Candida_glabrata 		PDK1_Bind       0.057   340     0.3495  ERVFKNWDFYKSRLC D340    -5.194
Q6BKW1|Debaryomyces_hansenii    	Grb2_SH2        0.121   333     0.4046  EAGFDKFYENWDTYS Y333    -4.345
Q6BKW1|Debaryomyces_hansenii    	ErkDD   	0.068   215     0.5206  YRTGLTAIPVDPSKV I215    -3.777
Q54K91|Dictyostelium_discoideum 	PKA_Kin 	0.055   21      0.2758  ESRIKRKSLLSKVVT S21     -4.455
P15937:Q7RV87|Neurospora_crassa 	PKC_epsilon     0.139   410     0.3816  FLRNSKYSIMHTPST S410    -3.066
P15937:Q7RV87|Neurospora_crassa 	PKC_zeta        0.036   18      0.4109  RARVKRPSMLKKLCN S18     -3.778
P15937:Q7RV87|Neurospora_crassa 	PKC_mu  	0.165   519     0.4187  KALVEEGSMAKVKF* S519    -3.263
P15937:Q7RV87|Neurospora_crassa 	ErkDD   	0.086   221     0.5357  DRIGTTSVPVDPEKV V221    -3.672
Q9UUJ9:P78872|Schizosaccharomyces_pombe PDK1_Bind       0.040   339     0.2255  EKFYSNYDYYKERIL D339    -5.843
Q6C3Z9|Yarrowia_lipolytica      	ErkDD   	0.117   144     0.5519  KKENSLDIAVIEATA I144    -3.559

Scripts for searching CDS regions extracted from PROKKA against NCBI's CDD databases

PROKKA is a software tool for the rapid annotation of prokaryotic genomes. You can use it to generate GFF3, GBK and SQN files for metagenomic whole-genome shotgun contigs assembled using any metagenomic assembly software. After identifying CDS regions using PROKKA, you can search the protein sequences against NCBI's Conserved Domain Databases (CDD). CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. While searching against these databases, refer to standard operating procedures for parameter selection. You can use the PROKKA_RPSBLAST.sh script that has the following syntax (-d switch to specify the database):
[uzi@quince-srv2 ~/annotation]$ ./PROKKA_RPSBLAST.sh
Usage:
    bash PROKKA_RPSBLAST.sh -f <fasta_file.faa> [options]

Options:
    -p Parallelize using GNU Parallel flag
    -c Number of cores to use (Default: 10)
    -d rpsblast db [Cdd,Cog,Kog,Pfam,Prk,Smart,Tigr] to use (Default: Cog)
    -t Number of threads to use (Default: 1)
    -r Number of reference matches (Default: 500)
    -e Expectation value (E) threshold for saving hits (Default: 0.00001)

For different databases, you can run the script as
./PROKKA_RPSBLAST.sh -p -c 20 -f PROKKA_XXXXXXXX/PROKKA_XXXXXXXX.faa -d Cdd
./PROKKA_RPSBLAST.sh -p -c 20 -f PROKKA_XXXXXXXX/PROKKA_XXXXXXXX.faa -d Cog
./PROKKA_RPSBLAST.sh -p -c 20 -f PROKKA_XXXXXXXX/PROKKA_XXXXXXXX.faa -d Kog
./PROKKA_RPSBLAST.sh -p -c 20 -f PROKKA_XXXXXXXX/PROKKA_XXXXXXXX.faa -d Pfam
./PROKKA_RPSBLAST.sh -p -c 20 -f PROKKA_XXXXXXXX/PROKKA_XXXXXXXX.faa -d Prk
./PROKKA_RPSBLAST.sh -p -c 20 -f PROKKA_XXXXXXXX/PROKKA_XXXXXXXX.faa -d Smart
./PROKKA_RPSBLAST.sh -p -c 20 -f PROKKA_XXXXXXXX/PROKKA_XXXXXXXX.faa -d Tigr
Cdd has all databases combined Cog/Kog/Pfam/Prk/Smart/Tigr i.e. everything. The PROKKA_CDD.py script can then be used with the GFF file generated from PROKKA:
[uzi@quince-srv2 ~/annotation]$ ./PROKKA_CDD.py
Usage:
        ./PROKKA_CDD.py -g <gfffile> -b <blastoutfile>

Optional parameters:
        -s (--scovs-threshold)          subject coverage threshold (Default:60)
        -p (--pident-threshold)         pident threshold (Default:0)

Example usage:

        Step 1: Run PROKKA_XXXXXXXX.faa with rpsblast against the [Cdd/Cog/Kog/Pfam/Prk/Smart/Tigr] database
        with following format:
                        rpsblast -query PROKKA_XXXXXXXX.faa -db [Cdd/Cog/Kog/Pfam/Prk/Smart/Tigr] -evalue 0.00001
                        -outfmt "6 qseqid sseqid evalue pident score qstart qend
                        sstart send length slen" -out blast_output.out

        Step 2: Run this script to generate CDD anotations:
                        ./PROKKA_CDD.py -g PROKKA_XXXXXXXX.gff -b blast_output.out
                         > annotations.tsv

        The output is a tab-delimited file with following fields:
                #Query                  with concatenated PROKKA CDS Ids using _ as delimeter
                Hit                     from CDD database using NCBI's eutils
                E-value                 from rpsblast
                Identity                from rpsblast
                Score                   from rpsblast
                Query-start             from rpsblast
                Query-end               from rpsblast
                Hit-start               from rpsblast
                Hit-end                 from rpsblast
                Hit-length              from rpsblast
                Abstract                from CDD database using NCBI's eutils
                Title                   from CDD database using NCBI's eutils
                Comments                CDS to contig mapping (from GFF file)

Refer to rpsblast tutorial: http://www2.warwick.ac.uk/fac/sci/moac/people/students/peter_cock/python/rpsblast/

Here is the output generated on an example dataset:
[uzi@quince-srv2 ~/annotation]$ ./PROKKA_CDD.py -g ../PROKKA_01092014/PROKKA_01092014.gff -b PROKKA_01092014.out
#Query  Hit     E-value Identity        Score   Query-start     Query-end       Hit-start       Hit-end Hit-length      Abstract        TitleComments
contig-9000000_07199    cd01189 1e-40   30.85   366     177     354     3       185     188     phiLC3 phage and phage-related integrases, site-specific recombinases, DNA breaking-rejoining enzymes, C-terminal catalytic domain. This CD includes various bacterial (mainly gram positive) and phage integrases, including those similar to Lactococcus phage phiLC3, TPW22, Tuc2009, BK5-T, A2, bIL285, bIL286, bIL311, ul36 and phi g1e; Staphylococcus aureus phage phi13 and phi42; Oenococcus oeni phage fOg44; Streptococcus thermophilus phage O1205 and Sfi21; and Streptococcus pyogenes phage T12 and T270.        INT_phiLC3_C    [138203,139355](-)
contig-9000000_07199    pfam00589       7e-25   29.19   250     177     356     3       167     185     Members of this family cleave DNA substrates by a series of staggered cuts, during which the protein becomes covalently linked to the DNA through a catalytic tyrosine residue at the carboxy end of the alignment. The catalytic site residues in CRE recombinase are Arg-173, His-289, Arg-292 and Tyr-324.   Phage_integrase [138203,139355](-)
contig-9000000_07199    cd00801 8e-25   23.42   259     41      364     57      351     333     Bacteriophage P4 integrase. P4-like integrases are found in temperate bacteriophages, integrative plasmids, pathogenicity and symbiosis islands, and other mobile genetic elements. They share the same fold in their catalytic domain and the overall reaction mechanism with the superfamily of DNA breaking-rejoining enzymes. The P4 integrase mediates integrative and excisive site-specific recombination between two sites, called attachment sites, located on the phage genome and the bacterial chromosome. The phage attachment site is often found adjacent to the integrase gene, while the host attachment sites are typically situated near tRNA genes. INT_P4  [138203,139355](-)
...

KNOWN BUG WITH PROKKA AND RESOLUTION: PROKKA (verson 1.7.2) doesn't seem to handle longer contig names (from assembly software) in the FASTA file and messes up the GBK file after annotation. Before using PROKKA, you can change the FASTA headers using the following one-liner:
[uzi@quince-srv2 ~/annotation]$ awk '/>/{gsub("^>","",$0);split($0,a,"_");$0=">C"a[2]}1' test.fasta > test_mname.fasta
[uzi@quince-srv2 ~/annotation]$ grep -i ">" test.fasta | head
>NODE_3_length_194_cov_3.484536
>NODE_4_length_167_cov_3.233533
>NODE_5_length_462_cov_13.017316
>NODE_6_length_181_cov_2.651934
>NODE_7_length_171_cov_3.040936
>NODE_8_length_395_cov_3.055696
>NODE_9_length_244_cov_2.434426
>NODE_10_length_186_cov_2.951613
>NODE_11_length_212_cov_3.089623
>NODE_12_length_182_cov_3.307692
[uzi@quince-srv2 ~/annotation]$ grep -i ">" test_mname.fasta | head
>C3
>C4
>C5
>C6
>C7
>C8
>C9
>C10
>C11
>C12

Even if you have errors in your GBK file, you can still fix them. First of all, dates should be in the LOCUS line (grep -i "^LOCUS" on GBK file). In the problematic GBK file, some dates go down to the next line. for example,
LOCUS       NODE_100_length_1535_cov_6.6912051605 bp   DNA linear
06-FEB-2014
DEFINITION  Genus species strain strain.
Therefore, first step is to move all those lines up to the LOCUS line. You can save the file as test.gbk, edit it in vim, and use the following command (use the date that appears in your GBK file):
:%s/\n06-FEB-2014/       06-FEB-2014/g
Secondly, notice that the length field is missing i.e. it says bp but where is the length? To resolve this, shorten the names and extract the lengths from contig names themselves i.e. by using the following command, "NODE_100_length_1535_cov_6.6912051605 bp" will change to "NODE_100 1535 bp" using this command:
[uzi@quince-srv2 ~/annotation]$ cat test.gbk | sed -e 's/\(NODE_[0-9]*\)_length_\([0-9]*\)_.* bp/\1\t\2 bp/g' > test2.gbk
If you load the GBK file up in Biopython, it will still throw warning messages. Why? because lengths extracted from the names are different than the actual lengths of the sequences!
/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/GenBank/__init__.py:1191: BiopythonParserWarning: Expected sequence length 16779, found 16849 (NODE_97).
  BiopythonParserWarning)
/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/GenBank/__init__.py:1191: BiopythonParserWarning: Expected sequence length 5819, found 5889 (NODE_98).
  BiopythonParserWarning)
/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/GenBank/__init__.py:1191: BiopythonParserWarning: Expected sequence length 197, found 267 (NODE_99).
  BiopythonParserWarning)
/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/GenBank/__init__.py:1191: BiopythonParserWarning: Expected sequence length 28909, found 28979 (NODE_9)
Don't worry, the file will still be loaded in Biopython with corrected lengths. After loading in Biopython, write the records to a different file and it will fix the lengths error:
from Bio import SeqIO
 
input_handle = open("test2.gbk", "rU")
output_handle = open("test3.gbk", "w")
 
sequences = SeqIO.parse(input_handle, "genbank")
count = SeqIO.write(sequences, output_handle, "genbank")
 
output_handle.close()
input_handle.close()	
For example, the original test2.gbk contains:
LOCUS       NODE_100    1535 bp   DNA linear       06-FEB-2014
DEFINITION  Genus species strain strain.
ACCESSION
VERSION
KEYWORDS    .
SOURCE      Genus species
  ORGANISM  Genus species
            Unclassified.
COMMENT     Annotated using prokka 1.7.2 from http://www.vicbioinformatics.com.
FEATURES             Location/Qualifiers
     source          1..1605
                     /organism="Genus species"
                     /mol_type="genomic DNA"
                     /strain="strain"
ORIGIN
        1 aaagtaaatt agggatattc agtaaatagt ataaaattag ccaactaatt catacacaaa
       61 atgttgcgaa ttagttggct tttttgtatc tttaggtatt cccccgcaaa taaggtttgg
      121 aatccaaaac gggggaaatt ccccaacaaa gatatggcta agatacaaaa tatttcggaa
      181 attcacccta ctttgggctt tacagaattt gatattatag aaaaatatcg caagagtttt
      241 catgagagtg agcttggcag gcttcattcg gtgtttccgt ttgagcgtat ggcaaagacc
      301 atgggcctgt cggaacagcg actgggacgc aggaacatct tcagtccttc ggcgaagatc
      361 gcccttatgg tcctgaaggc gtacaccgga ttctctgaca ggcagctggt ggaacatctt
      421 aacgggaaca tacactacca gatgttctgc gggattatga taaatccgtc ctttcccata
      481 accaactaca agatagtcag tgccatccgc aatgagatag cgtcccgtct tgatgttgat
      541 tccctccagg aagtgctggc ttcacactgg aaaccctatc ttgacagcct tcacgtctgt
      601 atgaccgatg ccacatgcta tgagagccat atgcgttttc ctacggatat gaaactcctt
      661 tgggaaagtc tggaatggct ctacaggcaa atctgccttc attgcagaga tttgggtata
      721 aggcgtccgc gcaacaagta tgcggatgtg gcgaagtcct atctgtccta ctgcaagaag
      781 agaaagagga aggcttcaag gacaaggatg ctcaagcgtc gtatgatcag actccttgaa
      841 aagctcctca tacagaggga tgaaatccat agagaacatg gaacctcact ccgatatacc
      901 caggattacc agaagcgtct ttccatcatc agaaaggttc ttgtacagga aaaggagttg
      961 tttgaaggca ggaaggtcag ggaccgcatc gtcagcattg accgtcatta tgtacgtccc
     1021 attgtaagag gcaaggaaac annnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     1081 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     1141 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     1201 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnncaaaag aaagggccac ccgacttgaa
     1261 ggcagcttcg gcacacagaa gcagcattac tcactctcaa ggataaaggc acggaacagg
     1321 aagacggaaa tcctatggat tttctttgga atacatacgg caaatgccat actgatgata
     1381 gataaaatca ggaaccgggc ggataaagcg gcatgacata aatttaccga cagaaccaga
     1441 aggggtcatc agacctcttc cgaagcgtcg tgtctaacaa gataggatta tgtgaaaaaa
     1501 cacagaaaaa gggcaataat aatggcatat aaagtgatta tactctatca actttatatg
     1561 ccatcttact ttttagggac atttactgaa tatccctatt taaat
//
where as test3.gbk now contains the correct lengths:
LOCUS       NODE_100                1605 bp    DNA              UNK 06-FEB-2014
DEFINITION  Genus species strain strain.
ACCESSION   NODE_100
VERSION     NODE_100
KEYWORDS    .
SOURCE      Genus species
  ORGANISM  Genus species Unclassified.
            .
COMMENT     Annotated using prokka 1.7.2 from http://www.vicbioinformatics.com.
FEATURES             Location/Qualifiers
     source          1..1605
                     /strain="strain"
                     /mol_type="genomic DNA"
                     /organism="Genus species"
ORIGIN
        1 aaagtaaatt agggatattc agtaaatagt ataaaattag ccaactaatt catacacaaa
       61 atgttgcgaa ttagttggct tttttgtatc tttaggtatt cccccgcaaa taaggtttgg
      121 aatccaaaac gggggaaatt ccccaacaaa gatatggcta agatacaaaa tatttcggaa
      181 attcacccta ctttgggctt tacagaattt gatattatag aaaaatatcg caagagtttt
      241 catgagagtg agcttggcag gcttcattcg gtgtttccgt ttgagcgtat ggcaaagacc
      301 atgggcctgt cggaacagcg actgggacgc aggaacatct tcagtccttc ggcgaagatc
      361 gcccttatgg tcctgaaggc gtacaccgga ttctctgaca ggcagctggt ggaacatctt
      421 aacgggaaca tacactacca gatgttctgc gggattatga taaatccgtc ctttcccata
      481 accaactaca agatagtcag tgccatccgc aatgagatag cgtcccgtct tgatgttgat
      541 tccctccagg aagtgctggc ttcacactgg aaaccctatc ttgacagcct tcacgtctgt
      601 atgaccgatg ccacatgcta tgagagccat atgcgttttc ctacggatat gaaactcctt
      661 tgggaaagtc tggaatggct ctacaggcaa atctgccttc attgcagaga tttgggtata
      721 aggcgtccgc gcaacaagta tgcggatgtg gcgaagtcct atctgtccta ctgcaagaag
      781 agaaagagga aggcttcaag gacaaggatg ctcaagcgtc gtatgatcag actccttgaa
      841 aagctcctca tacagaggga tgaaatccat agagaacatg gaacctcact ccgatatacc
      901 caggattacc agaagcgtct ttccatcatc agaaaggttc ttgtacagga aaaggagttg
      961 tttgaaggca ggaaggtcag ggaccgcatc gtcagcattg accgtcatta tgtacgtccc
     1021 attgtaagag gcaaggaaac annnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     1081 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     1141 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn
     1201 nnnnnnnnnn nnnnnnnnnn nnnnnnnnnn nnnncaaaag aaagggccac ccgacttgaa
     1261 ggcagcttcg gcacacagaa gcagcattac tcactctcaa ggataaaggc acggaacagg
     1321 aagacggaaa tcctatggat tttctttgga atacatacgg caaatgccat actgatgata
     1381 gataaaatca ggaaccgggc ggataaagcg gcatgacata aatttaccga cagaaccaga
     1441 aggggtcatc agacctcttc cgaagcgtcg tgtctaacaa gataggatta tgtgaaaaaa
     1501 cacagaaaaa gggcaataat aatggcatat aaagtgatta tactctatca actttatatg
     1561 ccatcttact ttttagggac atttactgaa tatccctatt taaat
//

Last Updated by Dr Umer Zeeshan Ijaz on 24/03/2014.