============================================== Umer Zeeshan Ijaz http://userweb.eng.gla.ac.uk/umer.ijaz Biopython Tutorial + File Formats Version 0.1 ============================================== CLUSTALW file format example: ============================= CLUSTAL W (1.81) multiple sequence alignment CYS1_DICDI -----MKVILLFVLAVFTVFVSS---------------RGIPPEEQ------------SQ ALEU_HORVU MAHARVLLLALAVLATAAVAVASSSSFADSNPIRPVTDRAASTLESAVLGALGRTRHALR CATH_HUMAN ------MWATLPLLCAGAWLLGV--------PVCGAAELSVNSLEK------------FH * :*.. : :. . . *. . CYS1_DICDI FLEFQDKFNKKY-SHEEYLERFEIFKSNLGKIEELNLIAINHKADTKFGVNKFADLSSDE ALEU_HORVU FARFAVRYGKSYESAAEVRRRFRIFSESLEEVRSTN----RKGLPYRLGINRFSDMSWEE CATH_HUMAN FKSWMSKHRKTY-STEEYHHRLQTFASNWRKINAHN----NGNHTFKMALNQFSDMSFAE * : .. *.* * * *: * .. :: * . .:.:*.*:*:* * CYS1_DICDI FKNYYLNNKEAIFTDDLPVADYLDDEFINSIPTAFDWRTRG-AVTPVKNQGQCGSCWSFS ALEU_HORVU FQATRL-GAAQTCSATLAGNHLMRDA--AALPETKDWREDG-IVSPVKNQAHCGSCWTFS CATH_HUMAN IKHKYLWSEPQNCSAT--KSNYLRGT--GPYPPSVDWRKKGNFVSPVKNQGACGSCWTFS :: * . : : . . * : *** * *:*****. *****:** CYS1_DICDI TTGNVEGQHFISQNKLVSLSEQNLVDCDHECMEYEGEEACDEGCNGGLQPNAYNYIIKNG ALEU_HORVU TTGALEAAYTQATGKNISLSEQQLVDCAGGFNNF--------GCNGGLPSQAFEYIKYNG CATH_HUMAN TTGALESAIAIATGKMLSLAEQQLVDCAQDFNNY--------GCQGGLPSQAFEYILYNK *** :*. : .* :**:**:**** :: **:*** .:*::** * CYS1_DICDI GIQTESSYPYTAETGTQCNFNSANIGAKISNFTMIP-KNETVMAGYIVSTGPLAIAADAV ALEU_HORVU GIDTEESYPYKGVNGV-CHYKAENAAVQVLDSVNITLNAEDELKNAVGLVRPVSVAFQVI CATH_HUMAN GIMGEDTYPYQGKDGY-CKFQPGKAIGFVKDVANITIYDEEAMVEAVALYNPVSFAFEVT ** *.:*** . * *:::. : : : . *. * : : *::.* :. CYS1_DICDI E-WQFYIGGVF-DIPCN--PNSLDHGILIVGYSAKNTIFRKNMPYWIVKNSWGADWGEQG ALEU_HORVU DGFRQYKSGVYTSDHCGTTPDDVNHAVLAVGYGVENGV-----PYWLIKNSWGADWGDNG CATH_HUMAN QDFMMYRTGIYSSTSCHKTPDKVNHAVLAVGYGEKNGI-----PYWIVKNSWGPQWGMNG : : * *:: . * *:.::*.:* ***. :* : ***::*****.:** :* CYS1_DICDI YIYLRRGKNTCGVSNFVSTSII-- ALEU_HORVU YFKMEMGKNMCAIATCASYPVVAA CATH_HUMAN YFLIERGKNMCGLAACASYPIPLV *: : *** *.:: .* .: Stolkholm file format: ====================== # STOCKHOLM 1.0 #=GF ID CBS #=GF AC PF00571 #=GF DE CBS domain #=GF AU Bateman A #=GF CC CBS domains are small intracellular modules mostly found #=GF CC in 2 or four copies within a protein. #=GF SQ 5 #=GS O31698/18-71 AC O31698 #=GS O83071/192-246 AC O83071 #=GS O83071/259-312 AC O83071 #=GS O31698/88-139 AC O31698 #=GS O31698/88-139 OS Bacillus subtilis O83071/192-246 MTCRAQLIAVPRASSLAEAIACAQKMRVSRVPVYERS #=GR O83071/192-246 SA 9998877564535242525515252536463774777 O83071/259-312 MQHVSAPVFVFECTRLAYVQHKLRAHSRAVAIVLDEY #=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEE O31698/18-71 MIEADKVAHVQVGNNLEHALLVLTKTGYTAIPVLDPS #=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH O31698/88-139 EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE #=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHHHEEEEEEEEEEEEEEEEEH #=GC SS_cons CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEH O31699/88-139 EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE #=GR O31699/88-139 AS ________________*____________________ #=GR O31699/88-139 IN ____________1____________2______0____ // Blast XML file format example: ============================== blastn BLASTN 2.2.12 [Aug-07-2005] Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. nr

gi|1348916|gb|G26684.1|G26684

human STS STS_D11570, sequence tagged site

285

-3

gi|1348916|gb|G26684.1|G26684

human STS STS_D11570, sequence tagged site

285

1 gi|9950606|gb|AE004854.1| Pseudomonas aeruginosa PAO1, section 415 of 529 of the complete genome AE004854 11884 1

38.1576

19 1.0598

6012

6030

19 19 0

CAGGCCAGCGACTTCTGGG CAGGCCAGCGACTTCTGGG ||||||||||||||||||| 2 gi|15073988|emb|AL591786.1|SME591786 Sinorhizobium meliloti 1021 complete chromosome; segment 5/12 AL591786 299350 1

36.1753

18 4.18768

204

224

83648

83628

-1

20 20 0

TGAAAGGAAATNAAAATGGAA TGAAAGGAAATCAAAATGGAA ||||||||||| |||||||||

371021

1233631384

0.710603 1.37406 1.30725 EMBL flat file format: ====================== ID SC10H5 standard; DNA; PRO; 4870 BP. XX AC AL031232; XX DE Streptomyces coelicolor cosmid 10H5. XX KW integral membrane protein. XX OS Streptomyces coelicolor OC Eubacteria; Firmicutes; Actinomycetes; Streptomycetes; OC Streptomycetaceae; Streptomyces. XX RN [1] RP 1-4870 RA Oliver K., Harris D.; RT ; RL Unpublished. XX RN [2] RP 1-4870 RA Parkhill J., Barrell B.G., Rajandream M.A.; RT ; RL Submitted (10-AUG-1998) to the EMBL/GenBank/DDBJ databases. RL Streptomyces coelicolor sequencing project, RL Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA RL E-mail: barrell@sanger.ac.uk RL Cosmids supplied by Prof. David A. Hopwood, [3] RL John Innes Centre, Norwich Research Park, Colney, RL Norwich, Norfolk NR4 7UH, UK. XX RN [3] RP 1-4870 RA Redenbach M., Kieser H.M., Denapaite D., Eichner A., RA Cullum J., Kinashi H., Hopwood D.A.; RT "A set of ordered cosmids and a detailed genetic and physical RT map for the 8 Mb Streptomyces coelicolor A3(2) chromosome."; RL Mol. Microbiol. 21(1):77-96(1996). XX CC Notes: CC CC Streptomyces coelicolor sequencing at The Sanger Centre is funded CC by the BBSRC. CC CC Details of S. coelicolor sequencing at the Sanger Centre CC are available on the World Wide Web. CC (URL; http://www.sanger.ac.uk/Projects/S_coelicolor/) CC CC CDS are numbered using the following system eg SC7B7.01c. CC SC (S. coelicolor), 7B7 (cosmid name), .01 (first CDS), CC c (complementary strand). CC CC The more significant matches with motifs in the PROSITE CC database are also included but some of these may be fortuitous. CC CC The length in codons is given for each CDS. CC CC Usually the highest scoring match found by fasta -o is given for CC CDS which show significant similarity to other CDS in the database. CC The position of possible ribosome binding site sequences are CC given where these have been used to deduce the initiation codon. CC CC Gene prediction is based on positional base preference in codons CC using a specially developed Hidden Markov Model (Krogh et al., CC Nucleic Acids Research, 22(22):4768-4778(1994)) and the FramePlot CC program of Bibb et al., Gene 30:157-66(1984) as implemented at CC http://www.nih.go.jp/~jun/cgi-bin/frameplot.pl. CAUTION: We may CC not have predicted the correct initiation codon. Where possible CC we choose an initiation codon (atg, gtg, ttg or (att)) which is CC preceded by an upstream ribosome binding site sequence (optimally CC 5-13bp before the initiation codon). If this cannot be identified CC we choose the most upstream initiation codon. CC CC IMPORTANT: This sequence MAY NOT be the entire insert of CC the sequenced clone. It may be shorter because we only CC sequence overlapping sections once, or longer, because we CC arrange for a small overlap between neighbouring submissions. CC CC Cosmid 10H5 lies to the right of 3A7 on the AseI-B genomic restriction CC fragment. XX FH Key Location/Qualifiers FH FT source 1..4870 FT /organism="Streptomyces coelicolor" FT /strain="A3(2)" FT /clone="cosmid 10H5" FT CDS complement(<1..327) FT /note="SC10H5.01c, unknown, partial CDS, len >109 aa; FT possible integral membrane protein" FT /gene="SC10H5.01c" FT /product="hypothetical protein SC10H5.01c" FT CDS complement(350..805) FT /note="SC10H5.02c, probable integral membrane protein, len: FT 151 aa; similar to S. coelicolor hypothetical protein FT TR:O54194 (EMBL:AL021411) SC7H1.35 (155 aa), fasta scores; FT opt: 431 z-score: 749.8 E(): 0, 53.5% identity in 114 aa FT overlap." FT /product="putative integral membrane protein" FT /gene="SC10H5.02c" FT RBS complement(812..815) FT /note="possible RBS upstream of SC10H5.02c" FT CDS complement(837..1301) FT /note="SC10H5.03c, probable integral membrane protein, len: FT 154 aa" FT /product="putative integral membrane protein" FT /gene="SC10H5.03c" FT RBS complement(1308..1312) FT /note="possible RBS upstream of SC10H5.03c" FT CDS complement(1427..1735) FT /note="SC10H5.04c, unknown, len: 103 aa; possible membrane" FT /gene="SC10H5.04c" FT /product="hypothetical protein SC10H5.04c" FT RBS complement(1738..1741) FT /note="possible RBS upstream of SC10H5.05c" FT misc_feature 1800^1801 FT /note="Zero-length feature added to test Bioperl parsing" FT CDS 1933..2022 FT /note="SC10H5.05, questionable ORF, len: 29 aa" FT /gene="SC10H5.05" FT /product="hypothetical protein SC10H5.05" FT CDS 2019..2642 FT /note="SC10H5.06, probable membrane protein, len: 207 aa; FT similar to S. coelicolor TR:O54192 SC7H1.33c (191 aa), FT fasta scores; opt: 312 z-score: 355.2 E(): 1.6e-12, 36.8% FT identity in 182 aa overlap" FT /product="putative membrane protein" FT /gene="SC10H5.06" FT RBS 2627..2631 FT /note="possible RBS upstream of SC10H5.07" FT CDS 2639..4048 FT /note="SC10H5.07, unknown, len: 469 aa" FT /gene="SC10H5.07" FT /product="hypothetical protein SC10H5.07" FT CDS complement(4100..4297) FT /note="SC10H5.08c, unknown, len: 65 aa" FT /gene="SC10H5.08c" FT /product="hypothetical protein SC10H5.08c" FT RBS complement(4314..4319) FT /note="possible RBS upstream of SC10H5.08c" FT CDS complement(4439..>4870) FT /note="SC10H5.09c, probable integral membrane protein, FT partial CDS len: >143 aa; some similarity in C-terminus to FT S. coelicolor hypothetical protein TR:O54106 FT (EMBL:AL021529) SC10A5.15 (114 aa), fasta scores; opt: 145 FT z-score: 233.8 E(): 9.2e-06, 33.3% identity in 81 aa FT overlap. Overlaps and extends SC3A7.01c" FT /product="putative integral membrane protein" FT /gene="SC10H5.09c" FT misc_feature 4769..4870 FT /note="overlap with cosmid 3A7 from 1 to 102" XX SQ Sequence 4870 BP; 769 A; 1717 C; 1693 G; 691 T; 0 other; gatcagtaga cccagcgaca gcagggcggg gcccagcagg ccggccgtgg cgtagagcgc 60 gaggacggcg accggcgtgg ccaccgacag gatggctgcg gcgacgcgga cgacaccgga 120 gtgtgccagg gcccaccaca cgccgatggc cgcgagcgcg agtcccgcgc tgccgaacag 180 ggcccacagc acactgcgca gaccggcggc cacgagtggc gccaggacgg tgcccagcag 240 gagcagcagg gtgacgtggg cgcgcgctgc actgtggccg ccccgtccgc ccgacgcgcg 300 cggctcgtca tctcgcggtc ccaccaccgg tcggccccat tactcgtcct caaccctgtg 360 gcgactgacg ttccccggac aggtcgtacc gattgccgcc acgccccacc acgcacaggg 420 cccagacgac gaagcctgac atggtgatca tgacgacgga ccacaccggg tagtacggca 480 gcgagaggaa gttggcgatg atcaccagcc cggcgatggc gaccccggtg acacgtgccc 540 acatcgccgt tttgagcagc ccggcgctga cgaccatggc gagcgcgccg agcgcgagat 600 ggatccaccc ccacccggtg agatcgaact ggaaaacgta gttgggcgtg gtgacgaaga 660 cgtcgtcctc ggcgatggcc atgatgcccc ggaagaggct gagcagcccg gcgaggaaga 720 gcatcaccgc cgcgaaggcg gtaaggcccg tcgcccattc ctgcctcgcg gtgtgtgccg 780 ggtggtgggt atgtgacgtg gtcatctcgg acctcgtttc gtggaatgcg gatgcttcag 840 cgagcggagg cgccggtgcc cgccgcgccc gtgtgccctg ccgggccgtg accggacagg 900 accaattcct tcgccttgcg gaactcctcg tccgtgatgg caccccggtc tcggatctcg 960 gagagccggg ccagctcgtc gacgctgctg gacccgccgc ccacggtctt cctgatgtag 1020 gcgtcgaact cctcctgctg agcccgtgcc cgcgttgtct cccggctgcc catgttcttg 1080 ccgcgagcga tcacgtagac gaaaacgccc aggaagggca ggaggatgca gaacaccaac 1140 cagccggcct tcgcccagcc actcagtccg tcgtcccgga agatgtcggt gacgacgcgg 1200 aagagcagga cgaaccacat gatccacagg aagatcatca gcatcgtcca gaaggcaccc 1260 agcagtgggt agtcgtacgc caggtaggtc tgtgcactca tgtccgtcct ccgtcctccg 1320 gggcgcggcc cggcggccct cgttccgtac tgacatcagg gtggtcacgg gtcccaccgg 1380 tcggcatcac ccggcacggg tgagtggggc gccgaggccg tcgtggtcag gcccgggaca 1440 ccggtgtgac cctggtggaa ggacgcgtcc cgtggggcac gcaccgccgg ccgagggcga 1500 ccaccgcctc ggtcagtccg agcaggccca gccacaggcc gagaagtcgg gtcagggcac 1560 gggccgactc ggcgggcagc gcgaggacga cgattccggc gacgtcgacg gccagcgggt 1620 tgcgcaggcc cagcactccg gccggggcgc ccggcaccag cgtggcgagg gccgatgcca 1680 tgagccaggt ccaggaaccc ccaagcctgg cgaggacgtg cgccggatcg ctcaatgctc 1740 cggtgaccgc cccgcccgac ccgtctccct tgtcggcagg ttccgccgca tcacgcggaa 1800 cggagatggc tcccctgtgg atcgggcggc cgctgcgggg ccgcccggtt ggtcggtcgg 1860 tgagcgccgg actccccctt cagctcttcc agggtcgggg tcgacaccga ggtcctggat 1920 cacccgtcag gggtgatccg ggcatgccgt cgtggcggtg aggtgggata cgggaacgat 1980 cggcccacgg gggaccggac gagacgaaga gacgtgagat gagcgatacg aactcgggcg 2040 gcgggcgcca ggccgcttcc ggaccggccc cacgtggccg actccctttc cgccggcgcg 2100 tggccctggt cgctgtcgca cgtcccctga tcgtcacggt cggtctcgtc accgcctact 2160 acctgcttcc cctggacgag agactcagcg ccggcaccct ggtgtcgctg gtgtgcggac 2220 tgctcgcagt ccttctggtg ttctgctggg aggtgcgggc catcacgcgc tccccgcatc 2280 cgcgtctgag agcgatcgag ggcctggccg ccacgctggt gctgttcctg gtcctcttcg 2340 ccggctccta ctacctgctg ggtcgctccg cgcccggctc cttcagcgag ccgctgaaca 2400 ggacggacgc gctgtacttc actctgacca cgttcgccac cgtcggcttc ggggacatca 2460 ccgcacgctc cgagaccggg cggatcctca cgatggcgca gatgacggga gggctactgc 2520 tcgtcggagt cgccgcccgg gtgctggcga gcgcagtgca ggcggggctg caccgacagg 2580 gccggggacc ggcggcatcg ccacgctccg gtgctgcgga ggagccggag gccggaccat 2640 gaccgtaccc ggtggcttca ccgcctccct gccgccggcc gagcgagccg cgtacggcag 2700 gaaggcccgt aaaagggcct cacgttcgtg ccacggctgg tacgagccgg ggcagcggcg 2760 gcctgacccc gtcgacctgc tggagcgcca gtccggcgag cgtgtcccgg cactcgtgcc 2820 catccgctac ggtcgcatgc tggagtcgcc gttccgcttc taccgcggtg cggcagcgat 2880 catggcggcg gacctggcac ccctgcccag cagcggactc caggtgcaat tgtgcgggga 2940 cgcgcacccg ttgaacttcc ggctcctggc ctcaccggag cgccggctgg tcttcgacat 3000 caacgacttc gacgagacgc tgcccggccc cttcgagtgg gacgtcaaac ggctggcggc 3060 cggattcgtg atcgcggccc ggtcgaacgg cttctcgtcc aaggaacaga accgcaccgt 3120 tcgggcctgt gtgcgggcct accgggagcg catgagggag ttcgccgtca tgccgaccct 3180 ggacatctgg tacgcccagg acgacgccga ccacgtacgg caactgctgg ctacggaggc 3240 cagaggagaa gctgagcagc ggctcaggga cgcggctgcg aaggcccgca cacgcaccca 3300 catgagggcg ttcgcgaagc tcacccgcgt cacggccgag ggccggcgca tcacccccga 3360 cccgccgctg atcaccccac tcggcgatct gctcaccgac ccggccgaag ccggccggga 3420 ggaggaactg cggtccgtcg tgaacggcta cgcacggtcc ctgccgcccg agcgccggca 3480 cctgctgcgt cactaccggc ttgtggacat ggcgcgcaag gtggtcggcg tcggcagtgt 3540 cggcacccgc tgctgggtac tgcttctgct cggcagggac gacgacgatc ctctgctgct 3600 ccaggccaag gaagcctcgg aatcggtgct ggcggcccac acgggcggcg aacgctacga 3660 ccatcagggc cgcagggtcg tggccggcca gcgtctgatc cagaccaccg gtgacatctt 3720 tctcggctgg gcgcgcgtca ccggcttcga cggaaaggcc cgggacttct acgtgcgtca 3780 actgtgggac tggaagggcg tcgcgcggcc ggaaaccatg gggcccgacc tgctctccct 3840 cttcgcccgg ctgtgcggtg cctgcctggc gagggcccac gcccgttccg gtgaccccgt 3900 cgcgctcgcc gcgtacctgg gcggcagcga ccgcttcgac ggcgcgctca ccgagttcgc 3960 ccagtcctac gccgatcaga atgaacgcga ccacgaagct ctgctggcgg cctgccgctc 4020 cggcagggtc acggccgccc gtttgtgagg ccgacccggg aacggccggc gggctggcac 4080 acaccgccgc cggtcggcgt cattccggaa gctgccgcat ctccaggacg cgcaggccca 4140 gcgactggca gcgggtgagc aacccgtaca gatgggcctc gtcgatcacc gtgccgaaca 4200 gcacggtctg gccggacatg acgacgtgct ccagctccgg gaacgcgttg gccagcgtcc 4260 gtgacaggtg tccctcgacg cggatctcgt agcgcacgag cggtcctttc accgtaggag 4320 ctcgggacac cgcccggggc tccgggtcgg acggtgctct tggtgacgag cctgcgcctc 4380 gtcgccctcc ggtgccctca cccagcacag gtgactccaa ccgcagtgtc agtgcctttc 4440 agtgcgtcac tgtgatcttg acgacgacga tcaccaggcc gagcagtacg ttgaccgtcg 4500 cggtgacggc caccagtcgt cgcgaggcgc ccgcgcggtg cgccgcggcg acggaccagc 4560 ccacctgacc ggcgacggcg acggacagcg ccagccacag ggtgcccggg acgtccagcc 4620 ccagtacggg gctgacggcg atggccgcgg ccggaggcac ggcggccttg acgatcggcc 4680 actcctcgcg gcacacacgc agaatcaccc gccggtccgg agtgtgccgc gcgagacgcg 4740 ctccgaacag ttcggcgtgg acgtgagcga tccagaacac caagctggtg agcaacagca 4800 gaagaaccag ttcggcgcgg gggaacgagc ccagggtgcc ggcgccgatc acgacggagg 4860 ctgcgagcat 4870 // Genbank file format: ==================== LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U49845.1 GI:1293613 KEYWORDS . SOURCE Saccharomyces cerevisiae (baker's yeast) ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), 1503-1509 (1994) PUBMED 7871890 REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein JOURNAL Genes Dev. 10 (7), 777-793 (1996) PUBMED 8846915 REFERENCE 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA FEATURES Location/Qualifiers source 1..5028 /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" /map="9" CDS <1..206 /codon_start=3 /product="TCP1-beta" /protein_id="AAA98665.1" /db_xref="GI:1293614" /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM" gene 687..3158 /gene="AXL2" CDS 687..3158 /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA98666.1" /db_xref="GI:1293615" /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL VDFSNKSNVNVGQVKDIHGRIPEML" gene complement(3300..4037) /gene="REV7" CDS complement(3300..4037) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA98667.1" /db_xref="GI:1293616" /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ FVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVD KDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNR RVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEK LISGDDKILNGVYSQYEEGESIFGSLF" ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct 121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa 181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg 241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa 301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa 361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat 421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctcaaagc tccttgccga 481 gagtcgccct cctttgtcga gtaattttca cttttcatat gagaacttat tttcttattc 541 tttactctca catcctgtag tgattgacac tgcaacagcc accatcacta gaagaacaga 601 acaattactt aatagaaaaa ttatatcttc ctcgaaacga tttcctgctt ccaacatcta 661 cgtatatcaa gaagcattca cttaccatga cacagcttca gatttcatta ttgctgacag 721 ctactatatc actactccat ctagtagtgg ccacgcccta tgaggcatat cctatcggaa 781 aacaataccc cccagtggca agagtcaatg aatcgtttac atttcaaatt tccaatgata 841 cctataaatc gtctgtagac aagacagctc aaataacata caattgcttc gacttaccga 901 gctggctttc gtttgactct agttctagaa cgttctcagg tgaaccttct tctgacttac 961 tatctgatgc gaacaccacg ttgtatttca atgtaatact cgagggtacg gactctgccg 1021 acagcacgtc tttgaacaat acataccaat ttgttgttac aaaccgtcca tccatctcgc 1081 tatcgtcaga tttcaatcta ttggcgttgt taaaaaacta tggttatact aacggcaaaa 1141 acgctctgaa actagatcct aatgaagtct tcaacgtgac ttttgaccgt tcaatgttca 1201 ctaacgaaga atccattgtg tcgtattacg gacgttctca gttgtataat gcgccgttac 1261 ccaattggct gttcttcgat tctggcgagt tgaagtttac tgggacggca ccggtgataa 1321 actcggcgat tgctccagaa acaagctaca gttttgtcat catcgctaca gacattgaag 1381 gattttctgc cgttgaggta gaattcgaat tagtcatcgg ggctcaccag ttaactacct 1441 ctattcaaaa tagtttgata atcaacgtta ctgacacagg taacgtttca tatgacttac 1501 ctctaaacta tgtttatctc gatgacgatc ctatttcttc tgataaattg ggttctataa 1561 acttattgga tgctccagac tgggtggcat tagataatgc taccatttcc gggtctgtcc 1621 cagatgaatt actcggtaag aactccaatc ctgccaattt ttctgtgtcc atttatgata 1681 cttatggtga tgtgatttat ttcaacttcg aagttgtctc cacaacggat ttgtttgcca 1741 ttagttctct tcccaatatt aacgctacaa ggggtgaatg gttctcctac tattttttgc 1801 cttctcagtt tacagactac gtgaatacaa acgtttcatt agagtttact aattcaagcc 1861 aagaccatga ctgggtgaaa ttccaatcat ctaatttaac attagctgga gaagtgccca 1921 agaatttcga caagctttca ttaggtttga aagcgaacca aggttcacaa tctcaagagc 1981 tatattttaa catcattggc atggattcaa agataactca ctcaaaccac agtgcgaatg 2041 caacgtccac aagaagttct caccactcca cctcaacaag ttcttacaca tcttctactt 2101 acactgcaaa aatttcttct acctccgctg ctgctacttc ttctgctcca gcagcgctgc 2161 cagcagccaa taaaacttca tctcacaata aaaaagcagt agcaattgcg tgcggtgttg 2221 ctatcccatt aggcgttatc ctagtagctc tcatttgctt cctaatattc tggagacgca 2281 gaagggaaaa tccagacgat gaaaacttac cgcatgctat tagtggacct gatttgaata 2341 atcctgcaaa taaaccaaat caagaaaacg ctacaccttt gaacaacccc tttgatgatg 2401 atgcttcctc gtacgatgat acttcaatag caagaagatt ggctgctttg aacactttga 2461 aattggataa ccactctgcc actgaatctg atatttccag cgtggatgaa aagagagatt 2521 ctctatcagg tatgaataca tacaatgatc agttccaatc ccaaagtaaa gaagaattat 2581 tagcaaaacc cccagtacag cctccagaga gcccgttctt tgacccacag aataggtctt 2641 cttctgtgta tatggatagt gaaccagcag taaataaatc ctggcgatat actggcaacc 2701 tgtcaccagt ctctgatatt gtcagagaca gttacggatc acaaaaaact gttgatacag 2761 aaaaactttt cgatttagaa gcaccagaga aggaaaaacg tacgtcaagg gatgtcacta 2821 tgtcttcact ggacccttgg aacagcaata ttagcccttc tcccgtaaga aaatcagtaa 2881 caccatcacc atataacgta acgaagcatc gtaaccgcca cttacaaaat attcaagact 2941 ctcaaagcgg taaaaacgga atcactccca caacaatgtc aacttcatct tctgacgatt 3001 ttgttccggt taaagatggt gaaaattttt gctgggtcca tagcatggaa ccagacagaa 3061 gaccaagtaa gaaaaggtta gtagattttt caaataagag taatgtcaat gttggtcaag 3121 ttaaggacat tcacggacgc atcccagaaa tgctgtgatt atacgcaacg atattttgct 3181 taattttatt ttcctgtttt attttttatt agtggtttac agatacccta tattttattt 3241 agtttttata cttagagaca tttaatttta attccattct tcaaatttca tttttgcact 3301 taaaacaaag atccaaaaat gctctcgccc tcttcatatt gagaatacac tccattcaaa 3361 attttgtcgt caccgctgat taatttttca ctaaactgat gaataatcaa aggccccacg 3421 tcagaaccga ctaaagaagt gagttttatt ttaggaggtt gaaaaccatt attgtctggt 3481 aaattttcat cttcttgaca tttaacccag tttgaatccc tttcaatttc tgctttttcc 3541 tccaaactat cgaccctcct gtttctgtcc aacttatgtc ctagttccaa ttcgatcgca 3601 ttaataactg cttcaaatgt tattgtgtca tcgttgactt taggtaattt ctccaaatgc 3661 ataatcaaac tatttaagga agatcggaat tcgtcgaaca cttcagtttc cgtaatgatc 3721 tgatcgtctt tatccacatg ttgtaattca ctaaaatcta aaacgtattt ttcaatgcat 3781 aaatcgttct ttttattaat aatgcagatg gaaaatctgt aaacgtgcgt taatttagaa 3841 agaacatcca gtataagttc ttctatatag tcaattaaag caggatgcct attaatggga 3901 acgaactgcg gcaagttgaa tgactggtaa gtagtgtagt cgaatgactg aggtgggtat 3961 acatttctat aaaataaaat caaattaatg tagcatttta agtataccct cagccacttc 4021 tctacccatc tattcataaa gctgacgcaa cgattactat tttttttttc ttcttggatc 4081 tcagtcgtcg caaaaacgta taccttcttt ttccgacctt ttttttagct ttctggaaaa 4141 gtttatatta gttaaacagg gtctagtctt agtgtgaaag ctagtggttt cgattgactg 4201 atattaagaa agtggaaatt aaattagtag tgtagacgta tatgcatatg tatttctcgc 4261 ctgtttatgt ttctacgtac ttttgattta tagcaagggg aaaagaaata catactattt 4321 tttggtaaag gtgaaagcat aatgtaaaag ctagaataaa atggacgaaa taaagagagg 4381 cttagttcat cttttttcca aaaagcaccc aatgataata actaaaatga aaaggatttg 4441 ccatctgtca gcaacatcag ttgtgtgagc aataataaaa tcatcacctc cgttgccttt 4501 agcgcgtttg tcgtttgtat cttccgtaat tttagtctta tcaatgggaa tcataaattt 4561 tccaatgaat tagcaatttc gtccaattct ttttgagctt cttcatattt gctttggaat 4621 tcttcgcact tcttttccca ttcatctctt tcttcttcca aagcaacgat ccttctaccc 4681 atttgctcag agttcaaatc ggcctctttc agtttatcca ttgcttcctt cagtttggct 4741 tcactgtctt ctagctgttg ttctagatcc tggtttttct tggtgtagtt ctcattatta 4801 gatctcaagt tattggagtc ttcagccaat tgctttgtat cagacaattg actctctaac 4861 ttctccactt cactgtcgag ttgctcgttt ttagcggaca aagatttaat ctcgttttct 4921 ttttcagtgt tagattgctc taattctttg agctgttctc tcagctcctc atatttttct 4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc // Legal values for the division code include: PRI - primate sequences ROD - rodent sequences MAM - other mammalian sequences VRT - other vertebrate sequences INV - invertebrate sequences PLN - plant, fungal, and algal sequences BCT - bacterial sequences VRL - viral sequences PHG - bacteriophage sequences SYN - synthetic sequences UNA - unannotated sequences EST - EST sequences (Expressed Sequence Tags) PAT - patent sequences STS - STS sequences (Sequence Tagged Sites) GSS - GSS sequences (Genome Survey Sequences) HTG - HTGS sequences (High Throughput Genomic sequences) HTC - HTC sequences (High Throughput cDNA sequences) ENV - Environmental sampling sequences CON - Constructed sequences TSA - Transcriptome Shotgun Assembly sequences more info on: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt http://www.insdc.org/documents/feature-table Feature key examples ==================== Key Description CDS Protein-coding sequence RBS ribosome binding site rep_origin Origin of replication protein_bind Protein binding site on DNA tRNA mature transfer RNA Qualifier examples: =================== Key Location/Qualifiers source 1..1509 /organism="Mus musculus" /strain="CD1" /mol_type="genomic DNA" promoter <1..9 /gene="ubc42" mRNA join(10..567,789..1320) /gene="ubc42" CDS join(54..567,789..1254) /gene="ubc42" /product="ubiquitin conjugating enzyme" /function="cell division control" Location examples: ================== The following is a list of common location descriptors with their meanings: Location Description 467 Points to a single base in the presented sequence 340..565 Points to a continuous range of bases bounded by and including the starting and ending bases <345..500 Indicates that the exact lower boundary point of a feature is unknown. The location begins at some base previous to the first base specified (which need not be contained in the presented sequence) and continues to and includes the ending base <1..888 The feature starts before the first sequenced base and continues to and includes base 888 1..>888 The feature starts at the first sequenced base and continues beyond base 888 102.110 Indicates that the exact location is unknown but that it is one of the bases between bases 102 and 110, inclusive 123^124 Points to a site between bases 123 and 124 join(12..78,134..202) Regions 12 to 78 and 134 to 202 should be joined to form one contiguous sequence complement(34..126) Start at the base complementary to 126 and finish at the base complementary to base 34 (the feature is on the strand complementary to the presented strand) complement(join(2691..4571,4918..5163)) Joins regions 2691 to 4571 and 4918 to 5163, then complements the joined segments (the feature is on the strand complementary to the presented strand) join(complement(4918..5163),complement(2691..4571)) Complements regions 4918 to 5163 and 2691 to 4571, then joins the complemented segments (the feature is on the strand complementary to the presented strand) J00194.1:100..202 Points to bases 100 to 202, inclusive, in the entry (in this database) with primary accession number 'J00194' join(1..100,J00194.1:100..202) Joins region 1..100 of the existing entry with the region 100..202 of remote entry J00194 Listing Symbol Meaning ------ ------- a a; adenine c c; cytosine g g; guanine t t; thymine in DNA; uracil in RNA m a or c r a or g w a or t s c or g y c or t k g or t v a or c or g; not t h a or c or t; not g d a or g or t; not c b c or g or t; not a n a or c or g or t Modified base abbreviations: ============================ Authority Sprinzl, M. and Gauss, D.H. Reference Sprinzl, M. and Gauss, D.H. Nucl Acid Res 10, r1 (1982). (note that in Cornish_Bowden, A. Nucl Acid Res 13, 3021-3030 (1985) the IUPAC-IUB declined to recommend a set of abbreviations for modified nucleotides) Contact NCBI Scope /mod_base Abbreviation Modified base description ------------ ------------------------- ac4c 4-acetylcytidine chm5u 5-(carboxyhydroxylmethyl)uridine cm 2'-O-methylcytidine cmnm5s2u 5-carboxymethylaminomethyl-2-thiouridine cmnm5u 5-carboxymethylaminomethyluridine d dihydrouridine fm 2'-O-methylpseudouridine gal q beta,D-galactosylqueosine gm 2'-O-methylguanosine i inosine i6a N6-isopentenyladenosine m1a 1-methyladenosine m1f 1-methylpseudouridine m1g 1-methylguanosine m1i 1-methylinosine m22g 2,2-dimethylguanosine m2a 2-methyladenosine m2g 2-methylguanosine m3c 3-methylcytidine m5c 5-methylcytidine m6a N6-methyladenosine m7g 7-methylguanosine mam5u 5-methylaminomethyluridine mam5s2u 5-methoxyaminomethyl-2-thiouridine man q beta,D-mannosylqueosine mcm5s2u 5-methoxycarbonylmethyl-2-thiouridine mcm5u 5-methoxycarbonylmethyluridine mo5u 5-methoxyuridine ms2i6a 2-methylthio-N6-isopentenyladenosine ms2t6a N-((9-beta-D-ribofuranosyl-2-methyltiopurine-6-yl)car bamoyl)threonine mt6a N-((9-beta-D-ribofuranosylpurine-6-yl)N-methyl-carbam oyl)threonine mv uridine-5-oxyacetic acid-methylester o5u uridine-5-oxyacetic acid (v) osyw wybutoxosine p pseudouridine q queosine s2c 2-thiocytidine s2t 5-methyl-2-thiouridine s2u 2-thiouridine s4u 4-thiouridine m5u 5-methyluridine t6a N-((9-beta-D-ribofuranosylpurine-6-yl)carbamoyl)threo nine tm 2'-O-methyl-5-methyluridine um 2'-O-methyluridine yw wybutosine x 3-(3-amino-3-carboxypropyl)uridine, (acp3)u OTHER (requires /note= qualifier) Amino acid abbreviations: ========================= Authority IUPAC-IUB Joint Commission on Biochemical Nomenclature. Reference IUPAC-IUB Joint Commission on Biochemical Nomenclature. Nomenclature and Symbolism for Amino Acids and Peptides. Eur. J. Biochem. 138:9-37(1984). IUPAC-IUBMB JCBN Newsletter, 1999 http://www.chem.qmul.ac.uk/iubmb/newsletter/1999/item3.html Scope /anticodon, /transl_except Contact EMBL-EBI Listing (note that the abbreviations are legal values for amino acids, not the full names) Abbreviation Amino acid name ------------ --------------- Ala A Alanine Arg R Arginine Asn N Asparagine Asp D Aspartic acid (Aspartate) Cys C Cysteine Gln Q Glutamine Glu E Glutamic acid (Glutamate) Gly G Glycine His H Histidine Ile I Isoleucine Leu L Leucine Lys K Lysine Met M Methionine Phe F Phenylalanine Pro P Proline Pyl O Pyrrolysine Ser S Serine Sec U Selenocysteine Thr T Threonine Trp W Tryptophan Tyr Y Tyrosine Val V Valine Asx B Aspartic acid or Asparagine Glx Z Glutamine or Glutamic acid. Xaa X Any amino acid. Xle J Leucine or Isoleucine TERM termination codon Modified and unusual Amino Acids: ================================= Abbreviation Amino acid ------------ --------- Aad 2-Aminoadipic acid bAad 3-Aminoadipic acid bAla beta-Alanine, beta-Aminoproprionic acid Abu 2-Aminobutyric acid 4Abu 4-Aminobutyric acid, piperidinic acid Acp 6-Aminocaproic acid Ahe 2-Aminoheptanoic acid Aib 2-Aminoisobutyric acid bAib 3-Aminoisobutyric acid Apm 2-Aminopimelic acid Dbu 2,4-Diaminobutyric acid Des Desmosine Dpm 2,2'-Diaminopimelic acid Dpr 2,3-Diaminoproprionic acid EtGly N-Ethylglycine EtAsn N-Ethylasparagine Hyl Hydroxylysine aHyl allo-Hydroxylysine 3Hyp 3-Hydroxyproline 4Hyp 4-Hydroxyproline Ide Isodesmosine aIle allo-Isoleucine MeGly N-Methylglycine, sarcosine MeIle N-Methylisoleucine MeLys 6-N-Methyllysine MeVal N-Methylvaline Nva Norvaline Nle Norleucine Orn Ornithine OTHER (requires /note=) Records from the RefSeq database of reference sequences have a different accession number format that begins with two letters followed by an underscore bar and six or more digits, for example: NT_123456 constructed genomic contigs NM_123456 mRNAs NP_123456 proteins NC_123456 chromosomes Feature key: ============ assembly_gap: gap between two components of a genome or transcriptome assembly; attenuator: 1) region of DNA at which regulation of termination of transcription occurs, which controls the expression of some bacterial operons; 2) sequence segment located between the promoter and the first structural gene that causes partial termination of transcription C_region: constant region of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains; includes one or more exons depending on the particular chain CAAT_signal: CAAT box; part of a conserved sequence located about 75bp up-stream of the start point of eukaryotic transcription units which may be involved in RNA polymerase binding; consensus=GG(C or T)CAATCT CDS: coding sequence; sequence of nucleotides that corresponds with the sequence of amino acids in a protein (location includes stop codon); feature includes amino acid conceptual translation. centromere: region of biological interest identified as a centromere and which has been experimentally characterized; D-loop: displacement loop; a region within mitochondrial DNA in which a short stretch of RNA is paired with one strand of DNA, displacing the original partner DNA strand in this region; also used to describe the displacement of a region of one strand of duplex DNA by a single stranded invader in the reaction catalyzed by RecA protein D_segment: Diversity segment of immunoglobulin heavy chain, and T-cell receptor beta chain; enhancer: a cis-acting sequence that increases the utilization of (some) eukaryotic promoters, and can function in either orientation and in any location (upstream or downstream) relative to the promoter; exon: region of genome that codes for portion of spliced mRNA,rRNA and tRNA; may contain 5'UTR, all CDSs and 3' UTR; gap: gap in the sequence GC_signal: GC box; a conserved GC-rich region located upstream of the start point of eukaryotic transcription units which may occur in multiple copies or in either orientation;consensus=GGGCGG; gene: region of biological interest identified as a gene and for which a name has been assigned; iDNA: intervening DNA; DNA which is eliminated through any of several kinds of recombination; intron: a segment of DNA that is transcribed, but removed from within the transcript by splicing together the sequences(exons) on either side of it; J_segment: joining segment of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains; LTR: long terminal repeat, a sequence directly repeated at both ends of a defined sequence, of the sort typically found in retroviruses; mat_peptide: mature peptide or protein coding sequence; coding sequence for the mature or final peptide or protein product following post-translational modification; the location does not include the stop codon (unlike the corresponding CDS); misc_binding: site in nucleic acid which covalently or non-covalently binds another moiety that cannot be described by any other binding key (primer_bind or protein_bind); misc_difference: feature sequence is different from that presented in the entry and cannot be described by any other Difference key (unsure, old_sequence, variation, or modified_base); misc_feature: region of biological interest which cannot be described by any other feature key; a new or rare feature; misc_recomb: site of any generalized, site-specific or replicative recombination event where there is a breakage and reunion of duplex DNA that cannot be described by other recombination keys or qualifiers of source key (/proviral); misc_RNA: any transcript or RNA product that cannot be defined by other RNA keys (prim_transcript, precursor_RNA, mRNA, 5'UTR, 3'UTR, exon, CDS, sig_peptide, transit_peptide,mat_peptide, intron, polyA_site, ncRNA, rRNA and tRNA); misc_signal: any region containing a signal controlling or altering gene function or expression that cannot be described by other signal keys (promoter, CAAT_signal, TATA_signal,-35_signal, -10_signal, GC_signal, RBS, polyA_signal, enhancer, attenuator, terminator, and rep_origin). misc_structure: any secondary or tertiary nucleotide structure or conformation that cannot be described by other Structure keys (stem_loop and D-loop); mobile_element: region of genome containing mobile elements; modified_base: the indicated nucleotide is a modified nucleotide and should be substituted for by the indicated molecule (given in the mod_base qualifier value) mRNA: messenger RNA; includes 5'untranslated region (5'UTR), coding sequences (CDS, exon) and 3'untranslated region (3'UTR); ncRNA: a non-protein-coding gene, other than ribosomal RNA and transfer RNA, the functional molecule of which is the RNA transcript; N_region: extra nucleotides inserted between rearranged immunoglobulin segments. old_sequence: the presented sequence revises a previous version of the sequence at this location; operon: region containing polycistronic transcript including a cluster of genes that are under the control of the same regulatory sequences/promotor and in the same biological pathway oriT: origin of transfer; region of a DNA molecule where transfer is initiated during the process of conjugation or mobilization polyA_signal: recognition region necessary for endonuclease cleavage of an RNA transcript that is followed by polyadenylation; consensus=AATAAA; polyA_site: site on an RNA transcript to which will be added adenine residues by post-transcriptional polyadenylation; precursor_RNA: any RNA species that is not yet the mature RNA product; may include 5' untranslated region (5'UTR), coding sequences (CDS, exon), intervening sequences (intron) and 3' untranslated region (3'UTR); prim_transcript: primary (initial, unprocessed) transcript; includes 5' untranslated region (5'UTR), coding sequences (CDS, exon), intervening sequences (intron) and 3' untranslated region (3'UTR); primer_bind: non-covalent primer binding site for initiation of replication, transcription, or reverse transcription; includes site(s) for synthetic e.g., PCR primer elements; promoter: region on a DNA molecule involved in RNA polymerase binding to initiate transcription; protein_bind: non-covalent protein binding site on nucleic acid; RBS: ribosome binding site; How to check if biopython is installed: [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from __future__ import print_function >>> import Bio >>> print(Bio.__version__) 1.62 Bio.Seq Seq class: ================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> my_seq=Seq("AGTCTAGTCT") >>> my_seq Seq('AGTCTAGTCT', Alphabet()) >>> print(my_seq) AGTCTAGTCT >>> my_seq.alphabet Alphabet() >>> my_seq Seq('AGTCTAGTCT', Alphabet()) >>> my_seq.complement() Seq('TCAGATCAGA', Alphabet()) >>> my_seq.reverse_complement() Seq('AGACTAGACT', Alphabet()) >>> SeqRecord holds a sequence as Seq object with additional annotation, including and identifier, name, and description Bio.SeqIO is for reading and writing sequence file formats and works with SeqRecord object download files from: http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide [uzi@quince-srv2 ~/biopython]$ grep -i ">" ls_orchid.fasta >gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765654|emb|Z78529.1|CLZ78529 C.lichiangense 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765652|emb|Z78527.1|CYZ78527 C.yatabeanum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765651|emb|Z78526.1|CGZ78526 C.guttatum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765650|emb|Z78525.1|CAZ78525 C.acaule 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765649|emb|Z78524.1|CFZ78524 C.formosanum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765648|emb|Z78523.1|CHZ78523 C.himalaicum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765647|emb|Z78522.1|CMZ78522 C.macranthum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765646|emb|Z78521.1|CCZ78521 C.calceolus 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765645|emb|Z78520.1|CSZ78520 C.segawai 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765644|emb|Z78519.1|CPZ78519 C.pubescens 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765643|emb|Z78518.1|CRZ78518 C.reginae 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765642|emb|Z78517.1|CFZ78517 C.flavum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765641|emb|Z78516.1|CPZ78516 C.passerinum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765640|emb|Z78515.1|MXZ78515 M.xerophyticum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765639|emb|Z78514.1|PSZ78514 P.schlimii 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765638|emb|Z78513.1|PBZ78513 P.besseae 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765637|emb|Z78512.1|PWZ78512 P.wallisii 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765636|emb|Z78511.1|PEZ78511 P.exstaminodium 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765635|emb|Z78510.1|PCZ78510 P.caricinum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765634|emb|Z78509.1|PPZ78509 P.pearcei 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765633|emb|Z78508.1|PLZ78508 P.longifolium 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765632|emb|Z78507.1|PLZ78507 P.lindenii 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765631|emb|Z78506.1|PLZ78506 P.lindleyanum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765630|emb|Z78505.1|PSZ78505 P.sargentianum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765629|emb|Z78504.1|PKZ78504 P.kaiteurum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765628|emb|Z78503.1|PCZ78503 P.czerwiakowianum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765627|emb|Z78502.1|PBZ78502 P.boissierianum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765626|emb|Z78501.1|PCZ78501 P.caudatum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765625|emb|Z78500.1|PWZ78500 P.warszewiczianum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765624|emb|Z78499.1|PMZ78499 P.micranthum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765623|emb|Z78498.1|PMZ78498 P.malipoense 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765622|emb|Z78497.1|PDZ78497 P.delenatii 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765621|emb|Z78496.1|PAZ78496 P.armeniacum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765620|emb|Z78495.1|PEZ78495 P.emersonii 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765619|emb|Z78494.1|PNZ78494 P.niveum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765618|emb|Z78493.1|PGZ78493 P.godefroyae 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765617|emb|Z78492.1|PBZ78492 P.bellatulum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765616|emb|Z78491.1|PCZ78491 P.concolor 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765615|emb|Z78490.1|PFZ78490 P.fairrieanum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765614|emb|Z78489.1|PDZ78489 P.druryi 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765613|emb|Z78488.1|PTZ78488 P.tigrinum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765612|emb|Z78487.1|PHZ78487 P.hirsutissimum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765611|emb|Z78486.1|PBZ78486 P.barbigerum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765610|emb|Z78485.1|PHZ78485 P.henryanum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765609|emb|Z78484.1|PCZ78484 P.charlesworthii 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765608|emb|Z78483.1|PVZ78483 P.villosum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765607|emb|Z78482.1|PEZ78482 P.exul 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765606|emb|Z78481.1|PIZ78481 P.insigne 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765605|emb|Z78480.1|PGZ78480 P.gratrixianum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765604|emb|Z78479.1|PPZ78479 P.primulinum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765603|emb|Z78478.1|PVZ78478 P.victoria 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765602|emb|Z78477.1|PVZ78477 P.victoria 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765601|emb|Z78476.1|PGZ78476 P.glaucophyllum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765600|emb|Z78475.1|PSZ78475 P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765599|emb|Z78474.1|PKZ78474 P.kolopakingii 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765598|emb|Z78473.1|PSZ78473 P.sanderianum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765597|emb|Z78472.1|PLZ78472 P.lowii 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765596|emb|Z78471.1|PDZ78471 P.dianthum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765595|emb|Z78470.1|PPZ78470 P.parishii 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765594|emb|Z78469.1|PHZ78469 P.haynaldianum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765593|emb|Z78468.1|PAZ78468 P.adductum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765592|emb|Z78467.1|PSZ78467 P.stonei 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765591|emb|Z78466.1|PPZ78466 P.philippinense 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765590|emb|Z78465.1|PRZ78465 P.rothschildianum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765589|emb|Z78464.1|PGZ78464 P.glanduliferum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765588|emb|Z78463.1|PGZ78463 P.glanduliferum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765587|emb|Z78462.1|PSZ78462 P.sukhakulii 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765586|emb|Z78461.1|PWZ78461 P.wardii 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765585|emb|Z78460.1|PCZ78460 P.ciliolare 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765584|emb|Z78459.1|PDZ78459 P.dayanum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765583|emb|Z78458.1|PHZ78458 P.hennisianum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765582|emb|Z78457.1|PCZ78457 P.callosum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765581|emb|Z78456.1|PTZ78456 P.tonsum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765580|emb|Z78455.1|PJZ78455 P.javanicum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765579|emb|Z78454.1|PFZ78454 P.fowliei 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765578|emb|Z78453.1|PSZ78453 P.schoseri 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765577|emb|Z78452.1|PBZ78452 P.bougainvilleanum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765576|emb|Z78451.1|PHZ78451 P.hookerae 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765575|emb|Z78450.1|PPZ78450 P.papuanum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765574|emb|Z78449.1|PMZ78449 P.mastersianum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765573|emb|Z78448.1|PAZ78448 P.argus 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765572|emb|Z78447.1|PVZ78447 P.venustum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765571|emb|Z78446.1|PAZ78446 P.acmodontum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765570|emb|Z78445.1|PUZ78445 P.urbanianum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765569|emb|Z78444.1|PAZ78444 P.appletonianum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765568|emb|Z78443.1|PLZ78443 P.lawrenceanum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765567|emb|Z78442.1|PBZ78442 P.bullenianum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765566|emb|Z78441.1|PSZ78441 P.superbiens 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765565|emb|Z78440.1|PPZ78440 P.purpuratum 5.8S rRNA gene and ITS1 and ITS2 DNA >gi|2765564|emb|Z78439.1|PBZ78439 P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA [uzi@quince-srv2 ~/biopython]$ grep -c ">" ls_orchid.fasta 94 Read the FASTA file with Biopython: =================================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> for seq_record in SeqIO.parse("ls_orchid.fasta","fasta"): ... print(seq_record.id) ... print(repr(seq_record.seq)) ... print(len(seq_record)) ... gi|2765658|emb|Z78533.1|CIZ78533 Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet()) 740 gi|2765657|emb|Z78532.1|CCZ78532 Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC', SingleLetterAlphabet()) 753 gi|2765656|emb|Z78531.1|CFZ78531 Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', SingleLetterAlphabet()) 748 Here, biopython defaulted the alphabet to generic SingleLetterAlphabet() repr function: ============== >>> x = 'foo' >>> x 'foo' So the name x is attached to 'foo' string. When you call for example repr(x) the iterpreter put 'foo' instead of x and then calls repr('foo'). >>> repr(x) "'foo'" >>> x.__repr__() "'foo'" repr actually calls a magic method __repr__ of x, which gives a string containing the representation of the value 'foo' assigned to x. So it returns 'foo' inside the string "" resulting in "'foo'". The idea of repr is to give a string which contains a series of symbols which we can type in the interpreter and get the same value which was sent as an argument to repr. enumerate function: =================== [(i, j) for i, j in enumerate(mylist)] You need to put i,j inside a tuple for the list comprehension to work. Alternatively, given that enumerate() already returns a tuple, you can return it directly without unpacking it first: [pair for pair in enumerate(mylist)] Either way, the result that gets returned is as expected: > [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')] Read genbank file with biopython: ================================= [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> for seq_record in SeqIO.parse("ls_orchid.gbk","genbank"): ... print(seq_record.id) ... print(repr(seq_record.seq)) ... print(len(seq_record)) ... Z78533.1 Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA()) 740 Z78532.1 Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC', IUPACAmbiguousDNA()) 753 Z78531.1 Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', IUPACAmbiguousDNA()) 748 This time Bio.SeqIO has been able to choose a sensible alphabet, IUPAC Ambiguous DNA. IUPAC in detail: ================ Bio.Alphabet.IUPAC --> provides definition for proteins, DNA, and RNA and also provides the ability to customize and extend basic functions IUPACProtein ExtendedIUPACProtein (additional elements “U” (or “Sec” for selenocysteine) and “O” (or “Pyl” for pyrrolysine), plus the ambiguous symbols “B” (or “Asx” for asparagine or aspartic acid), “Z” (or “Glx” for glutamine or glutamic acid), “J” (or “Xle” for leucine isoleucine) and “X” (or “Xxx” for an unknown amino acid)) IUPACUnambiguousDNA IUPACAmbiguousDNA ExtendedIUPACDNA IUPACAmbiguousRNA UPACUnambiguousRNA [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq=Seq("AGTAGATGTA", IUPAC.unambiguous_dna) >>> my_seq Seq('AGTAGATGTA', IUPACUnambiguousDNA()) >>> my_seq.alphabet IUPACUnambiguousDNA() >>> for protein, in the above one use IUPAC.protein Accessing Seq members: ====================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq=Seq("GATCG", IUPAC.unambiguous_dna) >>> for index, letter in enumerate(my_seq): ... print("%i %s" % (index, letter)) ... 0 G 1 A 2 T 3 C 4 G >>> print(len(my_seq)) 5 >>> print(my_seq[0]) G >>> print(my_seq[2]) T >>> print(my_seq[-1]) G >>> Seq is similar to python strings: ================================= [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> "AAAA".count("AA") 2 >>> Seq("AAAA").count("AA") 2 >>> Calculate GC Contents: ====================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq=Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',IUPAC.unambiguous_dna) >>> len(my_seq) 32 >>> my_seq.count("G") 9 >>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq) 46.875 >>> [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> from Bio.SeqUtils import GC >>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna) >>> GC(my_seq) 46.875 >>> Do slices with a start, stop and stride (the step size, which defaults to one): =============================================================================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) >>> my_seq[4:12] Seq('GATGGGCC', IUPACUnambiguousDNA()) >>> my_seq[0::3] Seq('GCTGTAGTAAG', IUPACUnambiguousDNA()) >>> my_seq[1::3] Seq('AGGCATGCATC', IUPACUnambiguousDNA()) >>> my_seq[2::3] Seq('TAGCTAAGAC', IUPACUnambiguousDNA()) >>> my_seq[::-1] Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA()) >>> str(my_seq) 'GATCGATGGGCCTATATAGGATCGAAAATCGC' >>> print(my_seq) GATCGATGGGCCTATATAGGATCGAAAATCGC >>> fasta_format_string=">Name\n%s\n" % my_seq >>> print(fasta_format_string) >Name GATCGATGGGCCTATATAGGATCGAAAATCGC >>> my_seq.tostring() 'GATCGATGGGCCTATATAGGATCGAAAATCGC' >>> Concatenating or adding sequences: ================================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Alphabet import IUPAC >>> from Bio.Seq import Seq >>> protein_seq = Seq("EVRNAK", IUPAC.protein) >>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna) >>> protein_seq + dna_seq Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/Seq.py", line 248, in __add__ % (repr(self.alphabet), repr(other.alphabet))) TypeError: Incompatible alphabets IUPACProtein() and IUPACUnambiguousDNA() >>> from Bio.Alphabet import generic_alphabet >>> protein_seq.alphabet = generic_alphabet >>> dna_seq.alphabet = generic_alphabet >>> protein_seq + dna_seq Seq('EVRNAKACGT', Alphabet()) >>> [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_nucleotide >>> from Bio.Alphabet import IUPAC >>> nuc_seq = Seq("GATCGATGC", generic_nucleotide) >>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna) >>> nuc_seq Seq('GATCGATGC', NucleotideAlphabet()) >>> dna_seq Seq('ACGT', IUPACUnambiguousDNA()) >>> nuc_seq + dna_seq Seq('GATCGATGCACGT', NucleotideAlphabet()) >>> Changing case: ============== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> dna_seq = Seq("acgtACGT", generic_dna) >>> dna_seq Seq('acgtACGT', DNAAlphabet()) >>> dna_seq.upper() Seq('ACGTACGT', DNAAlphabet()) >>> dna_seq.lower() Seq('acgtacgt', DNAAlphabet()) >>> "GTAC" in dna_seq False >>> "GTAC" in dna_seq.upper() True >>> [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna) >>> dna_seq Seq('ACGT', IUPACUnambiguousDNA()) >>> dna_seq.lower() Seq('acgt', DNAAlphabet()) >>> IUPAC alphabets are for upper case sequences only [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) >>> my_seq Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA()) >>> my_seq.complement() Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA()) >>> my_seq.reverse_complement() Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA()) >>> my_seq[::-1] Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA()) >>> [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> protein_seq = Seq("EVRNAK", IUPAC.protein) >>> protein_seq.complement() Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/Seq.py", line 718, in complement raise ValueError("Proteins do not have complements!") ValueError: Proteins do not have complements! >>> TRANSCRIPTION & TRANSLATION: ============================ DNA coding strand (aka Crick strand, strand +1) 5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’ ||||||||||||||||||||||||||||||||||||||| 3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’ DNA template strand (aka Watson strand, strand −1) | Transcription ↓ 5’ AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’ Single stranded messenger RNA The actual biological transcription process works from the template strand, doing a reverse complement (TCAG → CUGA) to give the mRNA. [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) >>> coding_dna Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA()) >>> template_dna = coding_dna.reverse_complement() >>> template_dna Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT', IUPACUnambiguousDNA()) >>> messenger_rna = coding_dna.transcribe() >>> messenger_rna Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA()) >>> messenger_rna.back_transcribe() Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA()) >>> messenger_rna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate() Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(table="Vertebrate Mitochondrial") Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(to_stop=True) Seq('MAIVMGR', IUPACProtein()) >>> coding_dna.translate(table=2) Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*')) >>> coding_dna.translate(table=2, to_stop=True) Seq('MAIVMGRWKGAR', IUPACProtein()) >>> coding_dna.translate(table=2, stop_symbol="@") Seq('MAIVMGRWKGAR@', HasStopCodon(IUPACProtein(), '@')) >>> from Bio.Alphabet import generic_dna >>> gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \ ... "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \ ... "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \ ... "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \ ... "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA", ... generic_dna) >>> gene.translate(table="Bacterial") Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*', HasStopCodon(ExtendedIUPACProtein(), '*') >>> gene.translate(table="Bacterial", to_stop=True) Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein()) >>> gene.translate(table="Bacterial", cds=True) Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR', ExtendedIUPACProtein()) As you can see, all this does is switch T → U, and adjust the alphabet. In the bacterial genetic code GTG is a valid start codon, and while it does normally encode Valine, if used as a start codon it should be translated as methionine. This happens if you tell Biopython your sequence is a complete CDS http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes#SG1 ftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt http://www.kazusa.or.jp/codon/ [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Data import CodonTable >>> standard_table = CodonTable.unambiguous_dna_by_name["Standard"] >>> mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"] >>> standard_table = CodonTable.unambiguous_dna_by_id[1] >>> mito_table = CodonTable.unambiguous_dna_by_id[2] >>> print(standard_table) Table 1 Standard, SGC0 | T | C | A | G | --+---------+---------+---------+---------+-- T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA Stop| A T | TTG L(s)| TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-- C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L(s)| CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-- A | ATT I | ACT T | AAT N | AGT S | T A | ATC I | ACC T | AAC N | AGC S | C A | ATA I | ACA T | AAA K | AGA R | A A | ATG M(s)| ACG T | AAG K | AGG R | G --+---------+---------+---------+---------+-- G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V | GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+-- >>> print(mito_table) Table 2 Vertebrate Mitochondrial, SGC1 | T | C | A | G | --+---------+---------+---------+---------+-- T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA W | A T | TTG L | TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-- C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L | CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-- A | ATT I(s)| ACT T | AAT N | AGT S | T A | ATC I(s)| ACC T | AAC N | AGC S | C A | ATA M(s)| ACA T | AAA K | AGA Stop| A A | ATG M(s)| ACG T | AAG K | AGG Stop| G --+---------+---------+---------+---------+-- G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V(s)| GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+-- >>> mito_table.stop_codons ['TAA', 'TAG', 'AGA', 'AGG'] >>> mito_table.start_codons ['ATT', 'ATC', 'ATA', 'ATG', 'GTG'] >>> mito_table.forward_table["ACG"] 'T' >>> Comparing Seq objects: ====================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna) >>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna) >>> seq1 == seq2 /usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/Seq.py:174: FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception). In the interim please use id(seq1)==id(seq2) or str(seq1)==str(seq2) to make your code explicit and to avoid this warning. "and to avoid this warning.", FutureWarning) False >>> id(seq1) == id(seq2) False >>> id(seq1) == id(seq1) True >>> str(seq1) == str(seq2) True >>> str(seq1) == str(seq2) True >>> MutableSeq objects: =================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna) >>> my_seq[5] = "G" Traceback (most recent call last): File "", line 1, in TypeError: 'Seq' object does not support item assignment >>> mutable_seq = my_seq.tomutable() >>> mutable_seq MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA()) >>> from Bio.Seq import MutableSeq >>> mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna) >>> mutable_seq[5] = "C" >>> mutable_seq MutableSeq('GCCATCGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA()) >>> mutable_seq.remove("T") >>> mutable_seq MutableSeq('GCCACGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA()) >>> mutable_seq.reverse() >>> mutable_seq MutableSeq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG', IUPACUnambiguousDNA()) >>> str(mutable_seq) 'AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG' >>> new_seq = mutable_seq.toseq() >>> new_seq Seq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG', IUPACUnambiguousDNA()) >>> Difference between mutable and immutable objects in Python means that you can’t use a MutableSeq object as a dictionary key, but you can use a Python string or a Seq object in this way. UnknownSeq object: ================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import UnknownSeq >>> unk = UnknownSeq(20) >>> unk UnknownSeq(20, alphabet = Alphabet(), character = '?') >>> print(unk) ???????????????????? >>> len(unk) 20 >>> from Bio.Alphabet import IUPAC >>> unk_dna = UnknownSeq(20, alphabet=IUPAC.ambiguous_dna) >>> unk_dna UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N') >>> print(unk_dna) NNNNNNNNNNNNNNNNNNNN >>> unk_dna.complement() UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N') >>> unk_dna.reverse_complement() UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N') >>> unk_dna.transcribe() UnknownSeq(20, alphabet = IUPACAmbiguousRNA(), character = 'N') >>> unk_protein=unk_dna.translate() >>> unk_protein UnknownSeq(6, alphabet = ProteinAlphabet(), character = 'X') >>> print(unk_protein) XXXXXX >>> len(unk_protein) 6 >>> Use Biopython functions on strings: =================================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate >>> my_string = "GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG" >>> reverse_complement(my_string) 'CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC' >>> transcribe(my_string) 'GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG' >>> back_transcribe(my_string) 'GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG' >>> translate(my_string) 'AVMGRWKGGRAAG*' >>> SeqRecord object: ================= [uzi@quince-srv2 ~]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> simple_seq = Seq("GATC") >>> from Bio.SeqRecord import SeqRecord >>> simple_seq_r = SeqRecord(simple_seq) >>> simple_seq_r.id '' >>> simple_seq_r.id="AC12345" >>> simple_seq_r.description="made up sequence" >>> print(simple_seq_r.description) made up sequence >>> simple_seq_r.seq Seq('GATC', Alphabet()) >>> simple_seq_r=SeqRecord(simple_seq,id="AC12345") >>> simple_seq_r.annotations["evidence"] = "made up" >>> print(simple_seq_r.annotations) {'evidence': 'made up'} >>> print(simple_seq_r.annotations["evidence"]) made up >>> simple_seq_r.letter_annotations["phred_quality"]=[40,40,38,30] >>> print(simple_seq_r.letter_annotations) {'phred_quality': [40, 40, 38, 30]} >>> print(simple_seq_r.letter_annotations["phred_quality"]) [40, 40, 38, 30] The SeqRecord class itself is quite simple, and offers the following information as attributes: .seq – The sequence itself, typically a Seq object. .id – The primary ID used to identify the sequence – a string. In most cases this is something like an accession number. .name – A “common” name/id for the sequence – a string. In some cases this will be the same as the accession number, but it could also be a clone name. I think of this as being analogous to the LOCUS id in a GenBank record. .description – A human readable description or expressive name for the sequence – a string. .letter_annotations – Holds per-letter-annotations using a (restricted) dictionary of additional information about the letters in the sequence. The keys are the name of the information, and the information is contained in the value as a Python sequence (i.e. a list, tuple or string) with the same length as the sequence itself. This is often used for quality scores or secondary structure information (e.g. from Stockholm/PFAM alignment files). .annotations – A dictionary of additional information about the sequence. The keys are the name of the information, and the information is contained in the value. This allows the addition of more “unstructured” information to the sequence. .features – A list of SeqFeature objects with more structured information about the features on a sequence (e.g. position of genes on a genome, or domains on a protein sequence). .dbxrefs - A list of database cross-references as strings. SeqRecord example for a FASTA file: =================================== Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1 http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.fna [uzi@quince-srv2 ~]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> record=SeqIO.read("NC_005816.fna","fasta") >>> record SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', SingleLetterAlphabet()), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|', description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[]) >>> record.seq Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', SingleLetterAlphabet()) >>> record.id 'gi|45478711|ref|NC_005816.1|' >>> record.name 'gi|45478711|ref|NC_005816.1|' >>> record.description 'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence' >>> record.dbxrefs [] >>> record.annotations {} >>> record.letter_annotations {} >>> record.features [] >>> SeqRecord example for a GB file: ================================ Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1 http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.gb [uzi@quince-srv2 ~]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> record = SeqIO.read("NC_005816.gb", "genbank") >>> record SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=['Project:58037']) >>> record.seq Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()) >>> record.id 'NC_005816.1' >>> record.name 'NC_005816' >>> record.description 'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.' >>> record.letter_annotations {} >>> len(record.annotations) 11 >>> record.annotations {'comment': 'PROVISIONAL REFSEQ: This record has not yet been subject to final\nNCBI review. The reference sequence was derived from AE017046.\nCOMPLETENESS: full length.', 'sequence_version': 1, 'source': 'Yersinia pestis biovar Microtus str. 91001', 'taxonomy': ['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacteriales', 'Enterobacteriaceae', 'Yersinia'], 'keywords': [''], 'references': [Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...), Reference(title='Complete genome sequence of Yersinia pestis strain 91001, an isolate avirulent to humans', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)], 'accessions': ['NC_005816'], 'data_file_division': 'BCT', 'date': '21-JUL-2008', 'organism': 'Yersinia pestis biovar Microtus str. 91001', 'gi': '45478711'} >>> record.dbxrefs ['Project:58037'] >>> len(record.features) 41 >>> record.features [SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(9609), strand=1), type='source'), SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1954), strand=1), type='repeat_region'), SeqFeature(FeatureLocation(ExactPosition(86), ExactPosition(1109), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(86), ExactPosition(1109), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(86), ExactPosition(959), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(BeforePosition(110), ExactPosition(209), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(437), ExactPosition(812), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(1105), ExactPosition(1888), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(1105), ExactPosition(1888), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(1108), ExactPosition(1885), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(1366), AfterPosition(1669), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(1432), ExactPosition(1456), strand=1), type='misc_feature'), SeqFeature(CompoundLocation([FeatureLocation(ExactPosition(1435), ExactPosition(1459), strand=1), FeatureLocation(ExactPosition(1618), ExactPosition(1621), strand=1)], 'order'), type='misc_feature', location_operator='order'), SeqFeature(FeatureLocation(ExactPosition(1606), ExactPosition(1624), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(2924), ExactPosition(3119), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(2924), ExactPosition(3119), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(2924), ExactPosition(3107), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(3485), ExactPosition(3857), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(3485), ExactPosition(3857), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(3497), ExactPosition(3626), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(4342), ExactPosition(4780), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(4342), ExactPosition(4780), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(4814), ExactPosition(5888), strand=-1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(4814), ExactPosition(5888), strand=-1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(5909), ExactPosition(5911), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(5933), ExactPosition(5933), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(5933), ExactPosition(5933), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(5947), ExactPosition(5948), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(6004), ExactPosition(6421), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(6004), ExactPosition(6421), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(6524), ExactPosition(6525), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(6663), ExactPosition(7602), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(6663), ExactPosition(7602), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(6663), ExactPosition(7599), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(7788), ExactPosition(8088), strand=-1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(7788), ExactPosition(8088), strand=-1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(7836), ExactPosition(7995), strand=-1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(8087), ExactPosition(8360), strand=-1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(8087), ExactPosition(8360), strand=-1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(8090), AfterPosition(8357), strand=-1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(8529), ExactPosition(8529), strand=1), type='variation')] >>> SeqFeature object: ================== The attributes of a SeqFeature are: .type – This is a textual description of the type of feature (for instance, this will be something like ‘CDS’ or ‘gene’). .location – The location of the SeqFeature on the sequence that you are dealing with. The SeqFeature delegates much of its functionality to the location object, and includes a number of shortcut attributes for properties of the location: .ref – shorthand for .location.ref – any (different) reference sequence the location is referring to. Usually just None. .ref_db – shorthand for .location.ref_db – specifies the database any identifier in .ref refers to. Usually just None. .strand – shorthand for .location.strand – the strand on the sequence that the feature is located on. For double stranded nucleotide sequence this may either be 1 for the top strand, −1 for the bottom strand, 0 if the strand is important but is unknown, or None if it doesn’t matter. This is None for proteins, or single stranded sequences. .qualifiers – This is a Python dictionary of additional information about the feature. The key is some kind of terse one-word description of what the information contained in the value is about, and the value is the actual information. For example, a common key for a qualifier might be “evidence” and the value might be “computational (non-experimental).” This is just a way to let the person who is looking at the feature know that it has not be experimentally (i. e. in a wet lab) confirmed. Note that other the value will be a list of strings (even when there is only one string). This is a reflection of the feature tables in GenBank/EMBL files. .sub_features – This used to be used to represent features with complicated locations like ‘joins’ in GenBank/EMBL files. This has been deprecated with the introduction of the CompoundLocation object, and should now be ignored. How to make positions and locations: ==================================== [uzi@quince-srv2 ~]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqFeature >>> start_pos=SeqFeature.AfterPosition(5) >>> end_pos=SeqFeature.BetweenPosition(9,left=8,right=9) >>> my_location=SeqFeature.FeatureLocation(start_pos,end_pos) >>> print(my_location) [>5:(8^9)] >>> my_location.start AfterPosition(5) >>> print(my_location.start) >5 >>> my_location.end BetweenPosition(9, left=8, right=9) >>> print(my_location.end) (8^9) >>> int(my_location.start) 5 >>> int(my_location.end) 9 >>> my_location.nofuzzy_start 5 >>> my_location.nofuzzy_end 9 >>> exact_location=SeqFeature.FeatureLocation(5,9) >>> print(exact_location) [5:9] >>> exact_location.start ExactPosition(5) >>> int(exact_location.start) 5 >>> exact_location.nofuzzy_start 5 >>> position – This refers to a single position on a sequence, which may be fuzzy or not. For instance, 5, 20, <100 and >200 are all positions. – ExactPosition, BeforePosition (something like <13), AfterPosition(something like >13), WithinPosition (something like 1.5. The position attribute specifies the lower boundary of the range we are looking at, so in our example case this would be one. The extension attribute specifies the range to the higher boundary, so in this case it would be 4. So object.position is the lower boundary and object.position + object.extension is the upper boundary.), OneofPosition, UnknownPosition(This is not used in GenBank/EMBL, but corresponds to the ‘?’ feature coordinate used in UniProt) location – A location is region of sequence bounded by some positions. For instance 5..20 (i. e. 5 to 20) is a location. – FeatureLocation/CompoundLocation (for handling ‘join’ locations in EMBL/GenBank files) Locating SNP in a feature: ========================== [uzi@quince-srv2 ~]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> my_snp=4350 >>> record = SeqIO.read("NC_005816.gb", "genbank") >>> for feature in record.features: ... if my_snp in feature: ... print("%s %s" % (feature.type, feature.qualifiers.get('db_xref'))) ... source ['taxon:229193'] gene ['GeneID:2767712'] CDS ['GI:45478716', 'GeneID:2767712'] >>> You can use the Python keyword in with a SeqFeature or location object to see if the base/residue for a parent coordinate is within the feature/location or not Extracting features: ==================== [uzi@quince-srv2 ~]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.SeqFeature import SeqFeature, FeatureLocation >>> example_parent = Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACTGGGAGCCTAC") >>> example_feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1) >>> feature_seq = example_parent[example_feature.location.start:example_feature.location.end].reverse_complement() >>> print(feature_seq) AGCCTTTGCCGTC >>> feature_seq = example_feature.extract(example_parent) >>> print(feature_seq) AGCCTTTGCCGTC >>> print(len(example_feature.extract(example_parent))) 13 >>> print(len(example_feature)) 13 >>> print(len(example_feature.location)) 13 >>> For simple FeatureLocation objects the length is just the difference between the start and end positions. However, for a CompoundLocation the length is the sum of the constituent regions. We also have a Bio.SeqFeature.Reference class that stores the relevant information about a reference as attributes of an object. Any reference objects are stored as a list in the SeqRecord object’s annotations dictionary under the key “references” Printing as FASTA: ================== [uzi@quince-srv2 ~]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Seq import Seq >>> from Bio.SeqRecord import SeqRecord >>> from Bio.Alphabet import generic_protein >>> record = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" \ ... +"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \ ... +"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \ ... +"SSAC", generic_protein),id="gi|14150838|gb|AAK54648.1|AF376133_1",description="chalcone synthase [Cucumis sativus]") >>> print(record.format("fasta")) >gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis sativus] MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM SSAC >>> Slicing a SeqRecord: ==================== We’re going to focus in on the pim gene, YP_pPCP05. If you have a look at the GenBank file directly you’ll find this gene/CDS has location string 4343..4780, or in Python counting 4342:4780. From looking at the file you can work out that these are the twelfth and thirteenth entries in the file, so in Python zero-based counting they are entries 11 and 12 in the features list [uzi@quince-srv2 ~]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> record=SeqIO.read("NC_005816.gb", "genbank") >>> record SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=['Project:58037']) >>> len(record) 9609 >>> len(record.features) 41 >>> print(record.features[20]) type: gene location: [4342:4780](+) qualifiers: Key: db_xref, Value: ['GeneID:2767712'] Key: gene, Value: ['pim'] Key: locus_tag, Value: ['YP_pPCP05'] >>> print(record.features[21]) type: CDS location: [4342:4780](+) qualifiers: Key: codon_start, Value: ['1'] Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712'] Key: gene, Value: ['pim'] Key: locus_tag, Value: ['YP_pPCP05'] Key: note, Value: ['similar to many previously sequenced pesticin immunity protein entries of Yersinia pestis plasmid pPCP, e.g. gi| 16082683|,ref|NP_395230.1| (NC_003132) , gi|1200166|emb|CAA90861.1| (Z54145 ) , gi|1488655| emb|CAA63439.1| (X92856) , gi|2996219|gb|AAC62543.1| (AF053945) , and gi|5763814|emb|CAB531 67.1| (AL109969)'] Key: product, Value: ['pesticin immunity protein'] Key: protein_id, Value: ['NP_995571.1'] Key: transl_table, Value: ['11'] Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSHGIYGKQTTFKQTEFTNIKSNTKKHIALINKDNSWMISLKILGIKRDEYTVCFEDFSLIRPPTYVAIHPLLIKKVKSGNFIVVKEIKKSIPGCTVYYH'] >>> sub_record=record[4300:4800] >>> sub_record SeqRecord(seq=Seq('ATAAATAGATTATTCCAAATAATTTATTTATGTAAGAACAGGATGGGAGGGGGA...TTA', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=[]) >>> len(sub_record) 500 >>> sub_record.features [SeqFeature(FeatureLocation(ExactPosition(42), ExactPosition(480), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(42), ExactPosition(480), strand=1), type='CDS')] >>> sub_record.features[0] SeqFeature(FeatureLocation(ExactPosition(42), ExactPosition(480), strand=1), type='gene') >>> print(sub_record.features[0]) type: gene location: [42:480](+) qualifiers: Key: db_xref, Value: ['GeneID:2767712'] Key: gene, Value: ['pim'] Key: locus_tag, Value: ['YP_pPCP05'] >>> print(sub_record.features[1]) type: CDS location: [42:480](+) qualifiers: Key: codon_start, Value: ['1'] Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712'] Key: gene, Value: ['pim'] Key: locus_tag, Value: ['YP_pPCP05'] Key: note, Value: ['similar to many previously sequenced pesticin immunity protein entries of Yersinia pestis plasmid pPCP, e.g. gi| 16082683|,ref|NP_395230.1| (NC_003132) , gi|1200166|emb|CAA90861.1| (Z54145 ) , gi|1488655| emb|CAA63439.1| (X92856) , gi|2996219|gb|AAC62543.1| (AF053945) , and gi|5763814|emb|CAB531 67.1| (AL109969)'] Key: product, Value: ['pesticin immunity protein'] Key: protein_id, Value: ['NP_995571.1'] Key: transl_table, Value: ['11'] Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSHGIYGKQTTFKQTEFTNIKSNTKKHIALINKDNSWMISLKILGIKRDEYTVCFEDFSLIRPPTYVAIHPLLIKKVKSGNFIVVKEIKKSIPGCTVYYH'] >>> sub_record.description = "Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, partial." >>> print(sub_record.format("genbank")) LOCUS NC_005816 500 bp DNA UNK 01-JAN-1980 DEFINITION Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, partial. ACCESSION NC_005816 VERSION NC_005816.1 KEYWORDS . SOURCE . ORGANISM . . FEATURES Location/Qualifiers gene 43..480 /db_xref="GeneID:2767712" /locus_tag="YP_pPCP05" /gene="pim" CDS 43..480 /locus_tag="YP_pPCP05" /codon_start=1 /product="pesticin immunity protein" /transl_table=11 /note="similar to many previously sequenced pesticin immunity protein entries of Yersinia pestis plasmid pPCP, e.g. gi| 16082683|,ref|NP_395230.1| (NC_003132) , gi|1200166|emb|CAA90861.1| (Z54145 ) , gi|1488655| emb|CAA63439.1| (X92856) , gi|2996219|gb|AAC62543.1| (AF053945) , and gi|5763814|emb|CAB531 67.1| (AL109969)" /db_xref="GI:45478716" /db_xref="GeneID:2767712" /translation="MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNS LSHGIYGKQTTFKQTEFTNIKSNTKKHIALINKDNSWMISLKILGIKRDEYTVCFEDFS LIRPPTYVAIHPLLIKKVKSGNFIVVKEIKKSIPGCTVYYH" /gene="pim" /protein_id="NP_995571.1" ORIGIN 1 ataaatagat tattccaaat aatttattta tgtaagaaca ggatgggagg gggaatgatc 61 tcaaagttat tttgcttggc tctcatattt ttatcatcaa gtggccttgc agaaaaaaac 121 acatatacag caaaagacat cttgcaaaac ctagaattaa atacctttgg caattcattg 181 tctcatggca tctatgggaa acagacaacc ttcaagcaaa ccgagtttac aaatattaaa 241 agcaacacca aaaaacacat tgcacttatc aataaagaca actcatggat gatatcatta 301 aaaatactag gaattaagag agatgagtat actgtctgtt ttgaagattt ctctctaata 361 agaccgccaa catatgtagc catacatcct ctacttataa aaaaagtaaa atctggaaac 421 tttatagtag tgaaagaaat aaagaaatct atccctggtt gcactgtata ttatcattaa 481 tagcaagccc ctcattatta // >>> Adding SeqRecord objects: ========================= [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> record = next(SeqIO.parse("example.fastq", "fastq")) >>> len(record) 25 >>> print(record.seq) CCCTTCTTGTCTTCAGCGTTTCTCC >>> print(record.letter_annotations["phred_quality"]) [26, 26, 18, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 22, 26, 26, 26, 26, 26, 26, 26, 23, 23] >>> edited=record[:20]+record[21:] >>> print(edited.seq) CCCTTCTTGTCTTCAGCGTTCTCC >>> print(edited.letter_annotations["phred_quality"]) [26, 26, 18, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 22, 26, 26, 26, 26, 26, 26, 23, 23] >>> Shifting circular genome: ========================= [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> record = SeqIO.read("NC_005816.gb","genbank") >>> record SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=['Project:58037']) >>> len(record) 9609 >>> len(record.features) 41 >>> record.annotations.keys() ['comment', 'sequence_version', 'source', 'taxonomy', 'keywords', 'references', 'accessions', 'data_file_division', 'date', 'organism', 'gi'] >>> shifted = record[2000:]+record[:2000] >>> shifted SeqRecord(seq=Seq('GATACGCAGTCATATTTTTTACACAATTCTCTAATCCCGACAAGGTCGTAGGTC...GGA', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=[]) >>> len(shifted) 9609 >>> len(shifted.features) 40 >>> shifted.dbxrefs [] >>> shifted.annotations.keys() [] >>> shifted.dbxrefs=record.dbxrefs[:] >>> shifted.annotations=record.annotations.copy() >>> shifted.dbxrefs ['Project:58037'] >>> shifted.annotations {'comment': 'PROVISIONAL REFSEQ: This record has not yet been subject to final\nNCBI review. The reference sequence was derived from AE017046.\nCOMPLETENESS: full length.', 'date': '21-JUL-2008', 'taxonomy': ['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacteriales', 'Enterobacteriaceae', 'Yersinia'], 'gi': '45478711', 'source': 'Yersinia pestis biovar Microtus str. 91001', 'references': [Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...), Reference(title='Complete genome sequence of Yersinia pestis strain 91001, an isolate avirulent to humans', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)], 'accessions': ['NC_005816'], 'data_file_division': 'BCT', 'keywords': [''], 'organism': 'Yersinia pestis biovar Microtus str. 91001', 'sequence_version': 1} >>> Reverse-complementing SeqRecord object: ======================================= [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> record = SeqIO.read("NC_005816.gb", "genbank") >>> print("%s %i %i %i %i" % (record.id, len(record), len(record.features), len(record.dbxrefs), len(record.annotations))) NC_005816.1 9609 41 1 11 >>> rc = record.reverse_complement(id="TESTING") >>> print("%s %i %i %i %i" % (rc.id, len(rc), len(rc.features), len(rc.dbxrefs), len(rc.annotations))) TESTING 9609 41 0 0 >>> List comprehension on SeqRecord: ================================ [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> identifiers = [seq_record.id for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank")] >>> identifiers ['Z78533.1', 'Z78532.1', 'Z78531.1', 'Z78530.1', 'Z78529.1', 'Z78527.1', 'Z78526.1', 'Z78525.1', 'Z78524.1', 'Z78523.1', 'Z78522.1', 'Z78521.1', 'Z78520.1', 'Z78519.1', 'Z78518.1', 'Z78517.1', 'Z78516.1', 'Z78515.1', 'Z78514.1', 'Z78513.1', 'Z78512.1', 'Z78511.1', 'Z78510.1', 'Z78509.1', 'Z78508.1', 'Z78507.1', 'Z78506.1', 'Z78505.1', 'Z78504.1', 'Z78503.1', 'Z78502.1', 'Z78501.1', 'Z78500.1', 'Z78499.1', 'Z78498.1', 'Z78497.1', 'Z78496.1', 'Z78495.1', 'Z78494.1', 'Z78493.1', 'Z78492.1', 'Z78491.1', 'Z78490.1', 'Z78489.1', 'Z78488.1', 'Z78487.1', 'Z78486.1', 'Z78485.1', 'Z78484.1', 'Z78483.1', 'Z78482.1', 'Z78481.1', 'Z78480.1', 'Z78479.1', 'Z78478.1', 'Z78477.1', 'Z78476.1', 'Z78475.1', 'Z78474.1', 'Z78473.1', 'Z78472.1', 'Z78471.1', 'Z78470.1', 'Z78469.1', 'Z78468.1', 'Z78467.1', 'Z78466.1', 'Z78465.1', 'Z78464.1', 'Z78463.1', 'Z78462.1', 'Z78461.1', 'Z78460.1', 'Z78459.1', 'Z78458.1', 'Z78457.1', 'Z78456.1', 'Z78455.1', 'Z78454.1', 'Z78453.1', 'Z78452.1', 'Z78451.1', 'Z78450.1', 'Z78449.1', 'Z78448.1', 'Z78447.1', 'Z78446.1', 'Z78445.1', 'Z78444.1', 'Z78443.1', 'Z78442.1', 'Z78441.1', 'Z78440.1', 'Z78439.1'] >>> all_species = [seq_record.annotations["organism"] for seq_record in \ ... SeqIO.parse("ls_orchid.gbk", "genbank")] >>> print(all_species[1:5]) ['Cypripedium californicum', 'Cypripedium fasciculatum', 'Cypripedium margaritaceum', 'Cypripedium lichiangense'] >>> import gzip >>> handle = gzip.open("ls_orchid.gbk.gz", "r") >>> print(sum(len(r) for r in SeqIO.parse(handle, "gb"))) 67518 >>> handle.close() >>> import bz2 >>> handle = bz2.BZ2File("ls_orchid.gbk.bz2", "r") >>> print(sum(len(r) for r in SeqIO.parse(handle, "gb"))) 67518 >>> handle.close() >>> records = (rec.reverse_complement(id="rc_"+rec.id, description = "reverse complement") \ ... for rec in SeqIO.parse("ls_orchid.fasta", "fasta") if len(rec)<700) >>> SeqIO.write(records, "rev_comp.fasta", "fasta") 18 Manually iterating through SeqIO.parse: ======================================= [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> record_iterator = SeqIO.parse("ls_orchid.fasta", "fasta") >>> first_record = next(record_iterator) >>> print(first_record.id) gi|2765658|emb|Z78533.1|CIZ78533 >>> second_record = next(record_iterator) >>> print(second_record.id) gi|2765657|emb|Z78532.1|CCZ78532 >>> Getting a list of records from SeqIO: ===================================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> records = list(SeqIO.parse("ls_orchid.gbk", "genbank")) >>> print("Found %i records" % len(records)) Found 94 records >>> print(records[-1].id) Z78439.1 >>> print(records[0].id) Z78533.1 >>> Searching DBs online: ===================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import Entrez >>> from Bio import SeqIO >>> Entrez.email = "A.N.Other@example.com" >>> handle = Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text", id="6273291") >>> seq_record = SeqIO.read(handle, "fasta") >>> handle.close() >>> print("%s with %i features" % (seq_record.id, len(seq_record.features))) gi|6273291|gb|AF191665.1|AF191665 with 0 features >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id="6273291") >>> seq_record = SeqIO.read(handle, "gb") #using "gb" as an alias for "genbank" >>> handle.close() >>> print("%s with %i features" % (seq_record.id, len(seq_record.features))) AF191665.1 with 3 features >>> handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", \ ... id="6273291,6273290,6273289") >>> for seq_record in SeqIO.parse(handle, "gb"): ... print seq_record.id, seq_record.description[:50] + "..." ... print "Sequence length %i," % len(seq_record), ... print "%i features," % len(seq_record.features), ... print "from: %s" % seq_record.annotations["source"] ... AF191665.1 Opuntia marenae rpl16 gene; chloroplast gene for c... Sequence length 902, 3 features, from: chloroplast Corynopuntia marenae AF191664.1 Opuntia clavata rpl16 gene; chloroplast gene for c... Sequence length 899, 3 features, from: chloroplast Grusonia clavata AF191663.1 Opuntia bradtiana rpl16 gene; chloroplast gene for... Sequence length 899, 3 features, from: chloroplast Grusonia bradtiana >>> handle.close() >>> from Bio import ExPASy >>> handle = ExPASy.get_sprot_raw("O23729") >>> seq_record = SeqIO.read(handle, "swiss") >>> handle.close() >>> print(seq_record.id) O23729 >>> print(seq_record.name) CHS3_BROFI >>> print(seq_record.description) RecName: Full=Chalcone synthase 3; EC=2.3.1.74; AltName: Full=Naringenin-chalcone synthase 3; >>> print(repr(seq_record.seq)) Seq('MAPAMEEIRQAQRAEGPAAVLAIGTSTPPNALYQADYPDYYFRITKSEHLTELK...GAE', ProteinAlphabet()) >>> print("Length %i" % len(seq_record)) Length 394 >>> print(seq_record.annotations["keywords"]) ['Acyltransferase', 'Flavonoid biosynthesis', 'Transferase'] >>> Sequence files as dictionaries: =============================== Bio.SeqIO.to_dict() --> This is basically a helper function to build a normal Python dictionary with each entry held as a SeqRecord object in memory, allowing you to modify the records. Bio.SeqIO.index() --> Useful middle ground, acting like a read only dictionary and parsing sequences into SeqRecord objects on demand Bio.SeqIO.index_db() --> Acts like a read only dictionary but stores the identifiers and file offsets in a file on disk (as an SQLite3 database), meaning it has very low memory requirements [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk", "genbank")) >>> len(orchid_dict) 94 >>> orchid_dict.keys()[1:10] ['Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1', 'Z78453.1', 'Z78441.1', 'Z78444.1', 'Z78529.1', 'Z78505.1'] >>> orchid_dict.values()[1:3] [SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAGCGGAAGGGTCATTGTTGAGATCACATAATA...AGC', IUPACAmbiguousDNA()), id='Z78464.1', name='Z78464', description='P.glanduliferum 5.8S rRNA gene and ITS1 and ITS2 DNA.', dbxrefs=[]), SeqRecord(seq=Seq('CGTAACCAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAGATCACAT...GCA', IUPACAmbiguousDNA()), id='Z78455.1', name='Z78455', description='P.javanicum 5.8S rRNA gene and ITS1 and ITS2 DNA.', dbxrefs=[])] >>> seq_record = orchid_dict["Z78475.1"] >>> print(seq_record) ID: Z78475.1 Name: Z78475 Description: P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA. Number of features: 5 /sequence_version=1 /source=Paphiopedilum supardii /taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Paphiopedilum'] /keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2'] /references=[Reference(title='Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...), Reference(title='Direct Submission', ...)] /accessions=['Z78475'] /data_file_division=PLN /date=30-NOV-2006 /organism=Paphiopedilum supardii /gi=2765600 Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', IUPACAmbiguousDNA()) >>> orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta", "fasta")) >>> print(orchid_dict.keys()[1:5]) ['gi|2765646|emb|Z78521.1|CCZ78521', 'gi|2765576|emb|Z78451.1|PHZ78451', 'gi|2765577|emb|Z78452.1|PBZ78452', 'gi|2765586|emb|Z78461.1|PWZ78461'] >>> def get_accession(record): ... """"Given a SeqRecord, return the accession number as a string. ... ... e.g. "gi|2765613|emb|Z78488.1|PTZ78488" -> "Z78488.1" ... """ ... parts = record.id.split("|") ... assert len(parts) == 5 and parts[0] == "gi" and parts[2] == "emb" ... return parts[3] ... >>> orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta", "fasta"), key_function=get_accession) >>> print(orchid_dict.keys()[1:5]) ['Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1'] >>> from Bio.SeqUtils.CheckSum import seguid >>> seguid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk", "genbank"), ... lambda rec : seguid(rec.seq)) >>> seguid_dict["4UjNimWz3PcmI0F0BOC5IksDrOg"] SeqRecord(seq=Seq('GGAAGGTCATTGCCGATATCACATAATAATTGATCGAGTTAATCTGGAGGATCT...GAG', IUPACAmbiguousDNA()), id='Z78441.1', name='Z78441', description='P.superbiens 5.8S rRNA gene and ITS1 and ITS2 DNA.', dbxrefs=[]) >>> orchid_dict = SeqIO.index("ls_orchid.gbk", "genbank") >>> orchid_dict.keys()[1:5] ['Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1'] >>> orchid_dict["Z78475.1"].seq Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', IUPACAmbiguousDNA()) >>> Getting the raw data for a record: ================================== ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> uniprot = SeqIO.index("uniprot_sprot.dat", "swiss") >>> handle = open("selected.dat", "w") >>> for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]: ... handle.write(uniprot.get_raw(acc)) ... >>> handle.close() >>> selected=SeqIO.index("selected.dat","swiss") >>> selected SeqIO.index('selected.dat', 'swiss', alphabet=None, key_function=None) >>> selected.keys() ['P19801', 'Q9TRC7', 'P33487', 'Q8JZQ5', 'P13689'] Indexing in sqlite3 database: ============================= ftp://ftp.ncbi.nih.gov/genbank/ get all the viral sequences gbvrl1.seq,gbvl2.seq, .... and store that in a subfolder viral [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> files = ["viral/gbvrl%i.seq" % (i+1) for i in range(29)] >>> gb_vrl = SeqIO.index_db("gbvrl.idx", files, "genbank") >>> print("%i sequences indexed" % len(gb_vrl)) 1507726 sequences indexed >>> print(gb_vrl["GQ333173.1"].description) HIV-1 isolate F12279A1 from Uganda gag protein (gag) gene, partial cds. >>> print(gb_vrl.get_raw("GQ333173.1")) LOCUS GQ333173 459 bp DNA linear VRL 21-OCT-2009 DEFINITION HIV-1 isolate F12279A1 from Uganda gag protein (gag) gene, partial cds. ACCESSION GQ333173 VERSION GQ333173.1 GI:254928714 KEYWORDS . SOURCE Human immunodeficiency virus 1 (HIV-1) ORGANISM Human immunodeficiency virus 1 Viruses; Retro-transcribing viruses; Retroviridae; Orthoretrovirinae; Lentivirus; Primate lentivirus group. REFERENCE 1 (bases 1 to 459) AUTHORS Gray,R.R., Tatem,A.J., Lamers,S., Hou,W., Laeyendecker,O., Serwadda,D., Sewankambo,N., Gray,R.H., Wawer,M., Quinn,T.C., Goodenow,M.M. and Salemi,M. TITLE Spatial phylodynamics of HIV-1 epidemic emergence in east Africa JOURNAL AIDS 23 (14), F9-F17 (2009) PUBMED 19644346 REFERENCE 2 (bases 1 to 459) AUTHORS Collinson-Streng,A.N., Redd,A.D., Sewankambo,N.K., Serwadda,D., Rezapour,M., Lamers,S.L., Gray,R.H., Wawer,M.J., Quinn,T.C. and Laeyendecker,O. TITLE Geographic HIV Type 1 Subtype Distribution in Rakai District, Uganda JOURNAL AIDS Res. Hum. Retroviruses 25 (10), 1045-1048 (2009) PUBMED 19803713 REFERENCE 3 (bases 1 to 459) AUTHORS Collinson-Streng,A., Redd,A.D., Sewankambo,N.K., Serwadda,D., Rezapour,M., Lamers,S.L., Gray,R.H., Wawer,M.J., Quinn,T.C. and Laeyendecker,O. TITLE Direct Submission JOURNAL Submitted (26-JUN-2009) National Institutes of Health, National Institute of Allergy and Infectious Diseases, Baltimore, MD 21205, USA FEATURES Location/Qualifiers source 1..459 /organism="Human immunodeficiency virus 1" /proviral /mol_type="genomic DNA" /isolate="F12279A1" /host="Homo sapiens" /db_xref="taxon:11676" /country="Uganda:Kasasa-Sanje" /collection_date="19-Jun-1995" /note="subtype: D" gene <1..>459 /gene="gag" CDS <1..>459 /gene="gag" /note="p24" /codon_start=1 /product="gag protein" /protein_id="ACT86536.1" /db_xref="GI:254928715" /translation="LNAWVKVIEEKAFSPEVIPMFSALSEGATPQDLNTMLNTVGGHQ AAMQMLKETINEEAAEWDRLHPVHAGPIAPGQMREPRGSDIAGTTSTLQEQIGWMTNN PPIPVGEIYKRWIILGLNKIVRMYSPVSILDIRQGPKEPFRDYVDRFYKTL" ORIGIN 1 ttgaatgcat gggtaaaagt aatagaggag aaggctttca gcccagaagt aatacccatg 61 ttttcagcat tatcagaagg agccacccca caagatttaa acaccatgct aaacacagtg 121 gggggacatc aagcagctat gcaaatgtta aaagagacaa tcaatgagga agctgcagaa 181 tgggataggc tacatccagt acatgcaggg cctattgcac caggccaaat gagagaacct 241 aggggaagtg atatagcagg aactactagt acccttcagg aacaaatagg atggatgaca 301 aacaatccac ctatcccagt aggagaaatc tataaaagat ggataatcct gggattaaat 361 aaaatagtaa gaatgtatag ccctgtcagc attttggaca taagacaagg accaaaggaa 421 ccctttagag actatgtaga tcggttctat aaaacttta // >>> Compressing and loading files (Blocked GNU zip format): ======================================================= >>> from Bio import SeqIO >>> orchid_dict = SeqIO.index("ls_orchid.gbk", "genbank") >>> len(orchid_dict) 94 $ bgzip -c ls_orchid.gbk > ls_orchid.gbk.bgz >>> from Bio import SeqIO >>> orchid_dict = SeqIO.index("ls_orchid.gbk.bgz", "genbank") >>> len(orchid_dict) 94 >>> from Bio import SeqIO >>> orchid_dict = SeqIO.index_db("ls_orchid.gbk.bgz.idx", "ls_orchid.gbk.bgz", "genbank") >>> len(orchid_dict) 94 Writing multiple records: ========================= Say rec1, rec2, and rec3 are all SeqRecords then, my_records=[rec1,rec2,rec3] and use from Bio import SeqIO SeqIO.write(my_records, "my_example.faa", "fasta") To convert between different sequence formats from Bio import SeqIO records = SeqIO.parse("ls_orchid.gbk", "genbank") count = SeqIO.write(records, "my_example.fasta", "fasta") print("Converted %i records" % count) or from Bio import SeqIO count = SeqIO.convert("ls_orchid.gbk", "genbank", "my_example.fasta", "fasta") print("Converted %i records" % count) Formatted string from SeqRecord: ================================ [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SeqIO >>> from StringIO import StringIO >>> records = SeqIO.parse("ls_orchid.gbk", "genbank") >>> out_handle = StringIO() >>> SeqIO.write(records, out_handle, "fasta") 94 >>> fasta_data = out_handle.getvalue() >>> fasta_data[0:200] '>Z78533.1 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA.\nCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA\nCGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGT\nGACCCTGATTTGTTGTTG' >>> out_handle = open("ls_orchid_long.tab", "w") >>> for record in SeqIO.parse("ls_orchid.gbk", "genbank"): ... if len(record) > 100: ... out_handle.write(record.format("tab")) ... >>> out_handle.close() [uzi@quince-srv2 ~/biopython]$ head -2 ls_orchid_long.tab Z78533.1 CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC Z78532.1 CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTACGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAGAAAAAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGGATATCTTGGCTCTAACATCGATGAAGAACGCAGCTAAATGCGATAAGTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCTCGAGGCCATCAGGCTAAGGGCACGCCTGCCTGGGCGTCGTGTGTTGCGTCTCTCCTACCAATGCTTGCTTGGCATATCGCTAAGCTGGCATTATACGGATGTGAATGATTGGCCCCTTGTGCCTAGGTGCGGTGGGTCTAAGGATTGTTGCTTTGATGGGTAGGAATGTGGCACGAGGTGGAGAATGCTAACAGTCATAAGGCTGCTATTTGAATCCCCCATGTTGTTGTATTTTTTCGAACCTACACAAGAACCTAATTGAACCCCAATGGAGCTAAAATAACCATTGGGCAGTTGATTTCCATTCAGATGCGACCCCAGGTCAGGCGGGGCCACCCGCTGAGTTGAGGC Bio.AlignIO: ============ Bio.AlignIO.parse() --> Will return an iterator which gives MultipleSeqAlignment objects Bio.AlignIO.read() --> Returns a single MultipleSeqAlignment object [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import AlignIO >>> alignment = AlignIO.read("PF05371_seed.sth", "stockholm") >>> print(alignment) SingleLetterAlphabet() alignment with 7 rows and 52 columns AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKA COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRA Q9T0Q8_BPIKE/1-52 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKA COATB_BPI22/32-83 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPM13/24-72 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA Q9T0Q9_BPFD/1-49 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRA COATB_BPIF1/22-73 >>> print("Alignment length %i" % alignment.get_alignment_length()) Alignment length 52 >>> for record in alignment: ... print("%s - %s" % (record.seq, record.id)) ... AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA - COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA - Q9T0Q8_BPIKE/1-52 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA - COATB_BPI22/32-83 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - COATB_BPM13/24-72 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA - COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - Q9T0Q9_BPFD/1-49 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA - COATB_BPIF1/22-73 >>> for record in alignment: ... if record.dbxrefs: ... print("%s %s" % (record.id, record.dbxrefs)) ... COATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;'] COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;'] Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;'] COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;'] >>> for record in alignment: ... print(record) ... ID: COATB_BPIKE/30-81 Name: COATB_BPIKE Description: COATB_BPIKE/30-81 Database cross-references: PDB; 1ifl ; 1-52; Number of features: 0 /start=30 /end=81 /accession=P03620.1 Per letter annotation for: secondary_structure Seq('AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA', SingleLetterAlphabet()) ID: Q9T0Q8_BPIKE/1-52 Name: Q9T0Q8_BPIKE Description: Q9T0Q8_BPIKE/1-52 Number of features: 0 /start=1 /end=52 /accession=Q9T0Q8.1 Seq('AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA', SingleLetterAlphabet()) ID: COATB_BPI22/32-83 Name: COATB_BPI22 Description: COATB_BPI22/32-83 Number of features: 0 /start=32 /end=83 /accession=P15416.1 Seq('DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA', SingleLetterAlphabet()) ID: COATB_BPM13/24-72 Name: COATB_BPM13 Description: COATB_BPM13/24-72 Database cross-references: PDB; 2cpb ; 1-49;, PDB; 2cps ; 1-49; Number of features: 0 /start=24 /end=72 /accession=P69541.1 Per letter annotation for: secondary_structure Seq('AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA', SingleLetterAlphabet()) ID: COATB_BPZJ2/1-49 Name: COATB_BPZJ2 Description: COATB_BPZJ2/1-49 Number of features: 0 /start=1 /end=49 /accession=P03618.1 Seq('AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA', SingleLetterAlphabet()) ID: Q9T0Q9_BPFD/1-49 Name: Q9T0Q9_BPFD Description: Q9T0Q9_BPFD/1-49 Database cross-references: PDB; 1nh4 A; 1-49; Number of features: 0 /start=1 /end=49 /accession=Q9T0Q9.1 Per letter annotation for: secondary_structure Seq('AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA', SingleLetterAlphabet()) ID: COATB_BPIF1/22-73 Name: COATB_BPIF1 Description: COATB_BPIF1/22-73 Database cross-references: PDB; 1ifk ; 1-50; Number of features: 0 /start=22 /end=73 /accession=P03619.2 Per letter annotation for: secondary_structure Seq('FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA', SingleLetterAlphabet()) >>> alignment = AlignIO.read("PF05371_seed.faa", "fasta") >>> print(alignment) SingleLetterAlphabet() alignment with 7 rows and 52 columns AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA Q9T0Q9_BPFD/1-49 AEGDDP...AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA CAPSD_BPM13/24-72 AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA CAPSD_BPZJ2/1-49 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRA CAPSD_BPIF1/22-73 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKA CAPSD_BPI22/32-83 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRA Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKA CAPSD_BPIKE/30-81 >>> Multiple pair-wise alignments: ============================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import AlignIO >>> for alignment in AlignIO.parse("multiplepairwise.fasta", "fasta", seq_count=2): ... print("Alignment length %i" % alignment.get_alignment_length()) ... for record in alignment: ... print("%s - %s" % (record.seq, record.id)) ... print("") ... Alignment length 19 ACTACGACTAGCTCAG--G - Alpha ACTACCGCTAGCTCAGAAG - XXX Alignment length 17 ACTACGACTAGCTCAGG - Alpha ACTACGGCAAGCACAGG - YYY Alignment length 21 --ACTACGAC--TAGCTCAGG - Alpha GGACTACGACAATAGCTCAGG - ZZZ >>> Writing multiple alignments: ============================ [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Alphabet import generic_dna >>> from Bio.Seq import Seq >>> from Bio.SeqRecord import SeqRecord >>> from Bio.Align import MultipleSeqAlignment >>> align1 = MultipleSeqAlignment([ ... SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"), ... SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"), ... SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"), ... ]) >>> align2 = MultipleSeqAlignment([ ... SeqRecord(Seq("GTCAGC-AG", generic_dna), id="Delta"), ... SeqRecord(Seq("GACAGCTAG", generic_dna), id="Epsilon"), ... SeqRecord(Seq("GTCAGCTAG", generic_dna), id="Zeta"), ... ]) >>> align3 = MultipleSeqAlignment([ ... SeqRecord(Seq("ACTAGTACAGCTG", generic_dna), id="Eta"), ... SeqRecord(Seq("ACTAGTACAGCT-", generic_dna), id="Theta"), ... SeqRecord(Seq("-CTACTACAGGTG", generic_dna), id="Iota"), ... ]) >>> my_alignments = [align1, align2, align3] >>> from Bio import AlignIO >>> AlignIO.write(my_alignments, "my_example.phy", "phylip") 3 >>> [uzi@quince-srv2 ~/biopython]$ cat my_example.phy 3 12 Alpha ACTGCTAGCT AG Beta ACT-CTAGCT AG Gamma ACTGCTAGDT AG 3 9 Delta GTCAGC-AG Epsilon GACAGCTAG Zeta GTCAGCTAG 3 13 Eta ACTAGTACAG CTG Theta ACTAGTACAG CT- Iota -CTACTACAG GTG [uzi@quince-srv2 ~/biopython]$ Converting between sequence alignment format: ============================================= [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import AlignIO >>> alignments = AlignIO.parse("PF05371_seed.sth", "stockholm") >>> count = AlignIO.write(alignments, "PF05371_seed.aln", "clustal") >>> print("Converted %i alignments" % count) Converted 1 alignments >>> count = AlignIO.convert("PF05371_seed.sth", "stockholm", "PF05371_seed.aln", "clustal") >>> print("Converted %i alignments" % count) Converted 1 alignments >>> AlignIO.convert("PF05371_seed.sth", "stockholm", "PF05371_seed.phy", "phylip") 1 >>> [uzi@quince-srv2 ~/biopython]$ cat PF05371_seed.aln CLUSTAL X (1.81) multiple sequence alignment COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS COATB_BPM13/24-72 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS Q9T0Q9_BPFD/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS COATB_BPIKE/30-81 KA Q9T0Q8_BPIKE/1-52 RA COATB_BPI22/32-83 KA COATB_BPM13/24-72 KA COATB_BPZJ2/1-49 KA Q9T0Q9_BPFD/1-49 KA COATB_BPIF1/22-73 RA [uzi@quince-srv2 ~/biopython]$ cat PF05371_seed.phy 7 52 COATB_BPIK AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS Q9T0Q8_BPI AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS COATB_BPI2 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS COATB_BPM1 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS COATB_BPZJ AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS Q9T0Q9_BPF AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS COATB_BPIF FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS KA RA KA KA KA KA RA [uzi@quince-srv2 ~/biopython]$ Multiple alignments as formatted strings: ========================================= [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import AlignIO >>> alignment = AlignIO.read("PF05371_seed.sth", "stockholm") >>> print(alignment.format("clustal")) CLUSTAL X (1.81) multiple sequence alignment COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS COATB_BPM13/24-72 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS Q9T0Q9_BPFD/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS COATB_BPIKE/30-81 KA Q9T0Q8_BPIKE/1-52 RA COATB_BPI22/32-83 KA COATB_BPM13/24-72 KA COATB_BPZJ2/1-49 KA Q9T0Q9_BPFD/1-49 KA COATB_BPIF1/22-73 RA >>> from StringIO import StringIO >>> >>> alignments = AlignIO.parse("PF05371_seed.sth", "stockholm") >>> >>> out_handle = StringIO() >>> AlignIO.write(alignments, out_handle, "clustal") 1 >>> clustal_data = out_handle.getvalue() >>> Locally installed multiple alignment tools: =========================================== >>> import Bio.Align.Applications >>> dir(Bio.Align.Applications) ... ['ClustalwCommandline', 'DialignCommandline', 'MafftCommandline', 'MuscleCommandline', 'PrankCommandline', 'ProbconsCommandline', 'TCoffeeCommandline' ...] The module Bio.Emboss.Applications has wrappers for some of the EMBOSS suite, including needle and water Running locally installed clustalw2: ==================================== /home/opt/clustalw-2.1/src/clustalw2 Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> from Bio.Align.Applications import ClustalwCommandline >>> clustalw_exec = "/home/opt/clustalw-2.1/src/clustalw2" >>> clustalw_cline = ClustalwCommandline(clustalw_exec, infile="opuntia.fasta") >>> assert os.path.isfile(clustalw_exec), "Clustal W executable missing" >>> stdout, stderr = clustalw_cline() >>> print(stdout) CLUSTAL 2.1 Multiple Sequence Alignments Sequence format is Pearson Sequence 1: gi|6273291|gb|AF191665.1|AF191665 902 bp Sequence 2: gi|6273290|gb|AF191664.1|AF191664 899 bp Sequence 3: gi|6273289|gb|AF191663.1|AF191663 899 bp Sequence 4: gi|6273287|gb|AF191661.1|AF191661 895 bp Sequence 5: gi|6273286|gb|AF191660.1|AF191660 893 bp Sequence 6: gi|6273285|gb|AF191659.1|AF191659 894 bp Sequence 7: gi|6273284|gb|AF191658.1|AF191658 896 bp Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 99 Sequences (1:3) Aligned. Score: 99 Sequences (1:4) Aligned. Score: 98 Sequences (1:5) Aligned. Score: 98 Sequences (1:6) Aligned. Score: 98 Sequences (1:7) Aligned. Score: 98 Sequences (2:3) Aligned. Score: 99 Sequences (2:4) Aligned. Score: 98 Sequences (2:5) Aligned. Score: 98 Sequences (2:6) Aligned. Score: 98 Sequences (2:7) Aligned. Score: 98 Sequences (3:4) Aligned. Score: 98 Sequences (3:5) Aligned. Score: 98 Sequences (3:6) Aligned. Score: 98 Sequences (3:7) Aligned. Score: 98 Sequences (4:5) Aligned. Score: 99 Sequences (4:6) Aligned. Score: 99 Sequences (4:7) Aligned. Score: 99 Sequences (5:6) Aligned. Score: 99 Sequences (5:7) Aligned. Score: 99 Sequences (6:7) Aligned. Score: 99 Guide tree file created: [opuntia.dnd] There are 6 groups Start of Multiple Alignment Aligning... Group 1: Sequences: 2 Score:16933 Group 2: Sequences: 2 Score:16703 Group 3: Sequences: 4 Score:16812 Group 4: Sequences: 2 Score:17071 Group 5: Sequences: 3 Score:16845 Group 6: Sequences: 7 Score:16678 Alignment Score 114256 CLUSTAL-Alignment file created [opuntia.aln] >>> from Bio import AlignIO >>> align = AlignIO.read("opuntia.aln", "clustal") >>> print(align) SingleLetterAlphabet() alignment with 7 rows and 906 columns TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF191 TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF191 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273291|gb|AF191665.1|AF191 >>> from Bio import Phylo >>> tree=Phylo.read("opuntia.dnd","newick") >>> Phylo.draw_ascii(tree) , gi|6273291|gb|AF191665.1|AF191665 | , gi|6273290|gb|AF191664.1|AF191664 | | gi|6273289|gb|AF191663.1|AF191663 | ___________________________________________, gi|6273287|gb|AF191661.1|AF191661 | | gi|6273286|gb|AF191660.1|AF191660 | , gi|6273285|gb|AF191659.1|AF191659 | | gi|6273284|gb|AF191658.1|AF191658 >>> [uzi@quince-srv2 ~/biopython]$ cat opuntia.dnd ( ( gi|6273291|gb|AF191665.1|AF191665:0.00418, ( gi|6273290|gb|AF191664.1|AF191664:0.00189, gi|6273289|gb|AF191663.1|AF191663:0.00145) :0.00083) :0.00770, ( gi|6273287|gb|AF191661.1|AF191661:0.00489, gi|6273286|gb|AF191660.1|AF191660:0.00295) :0.00014, ( gi|6273285|gb|AF191659.1|AF191659:0.00094, gi|6273284|gb|AF191658.1|AF191658:0.00018) :0.00125); [uzi@quince-srv2 ~/biopython]$ Running locally installed MUSCLE: ================================= /home/opt/qiime_software/muscle-3.8.31-release/muscle Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Align.Applications import MuscleCommandline >>> muscle_exec="/home/opt/qiime_software/muscle-3.8.31-release/muscle" >>> import os >>> assert os.path.isfile(muscle_exec), "muscle executable missing" >>> cline=MuscleCommandline(muscle_exec,input="opuntia.fasta",out="opuntia.txt") >>> print(cline) /home/opt/qiime_software/muscle-3.8.31-release/muscle -in opuntia.fasta -out opuntia.txt >>> cline=MuscleCommandline(muscle_exec,input="opuntia.fasta",out="opuntia.txt", clw=True) >>> print(cline) /home/opt/qiime_software/muscle-3.8.31-release/muscle -in opuntia.fasta -out opuntia.txt -clw >>> cline=MuscleCommandline(muscle_exec,input="opuntia.fasta",out="opuntia.txt", clwstrict=True) >>> print(cline) /home/opt/qiime_software/muscle-3.8.31-release/muscle -in opuntia.fasta -out opuntia.txt -clwstrict WHEN NOT SPECIFYING ANY OUTPUT >>> cline=MuscleCommandline(muscle_exec,input="opuntia.fasta") >>> stdout,stderr=cline() >>> from StringIO import StringIO >>> from Bio import AlignIO >>> align = AlignIO.read(StringIO(stdout),"fasta") >>> print(align) SingleLetterAlphabet() alignment with 7 rows and 906 columns TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF191663 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273291|gb|AF191665.1|AF191665 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF191664 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF191661 TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF191660 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF191659 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF191658 >>> import subprocess >>> import sys >>> child=subprocess.Popen(str(cline),stdout=subprocess.PIPE,stderr=subprocess.PIPE,shell=(sys.platform!="win32")) >>> align=AlignIO.read(child.stdout,"fasta") >>> print(align) SingleLetterAlphabet() alignment with 7 rows and 906 columns TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF191663 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273291|gb|AF191665.1|AF191665 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF191664 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF191661 TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF191660 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF191659 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF191658 WHEN NOT SPECIFYING ANY INPUT OR OUTPUT >>> from Bio import SeqIO >>> records = (r for r in SeqIO.parse("opuntia.fasta", "fasta") if len(r) < 900) >>> cline=MuscleCommandline(clwstrict=True) >>> cline=MuscleCommandline(muscle_exec,clwstrict=True) >>> print(cline) /home/opt/qiime_software/muscle-3.8.31-release/muscle -clwstrict >>> child=subprocess.Popen(str(cline),stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE,shell=(sys.platform!="win32")) >>> SeqIO.write(records,child.stdin,"fasta") 6 >>> child.stdin.close() >>> align=AlignIO.read(child.stdout,"clustal") >>> print(align) SingleLetterAlphabet() alignment with 6 rows and 900 columns TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF19166 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF19166 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF19166 TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF19166 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF19165 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF19165 WHEN NOT SPECIFYING ANY INPUT OR OUTPUT (NOT USING SUBPROCESS) >>> handle=StringIO() >>> records = (r for r in SeqIO.parse("opuntia.fasta", "fasta") if len(r) < 900) >>> SeqIO.write(records,handle,"fasta") 6 >>> data=handle.getvalue() >>> print(cline) /home/opt/qiime_software/muscle-3.8.31-release/muscle -clwstrict >>> stdout,stderr=cline(stdin=data) >>> align=AlignIO.read(StringIO(stdout),"clustal") >>> print(align) SingleLetterAlphabet() alignment with 6 rows and 900 columns TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF19166 TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF19166 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF19166 TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF19166 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF19165 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF19165 >>> Running locally installed EMBOSS tools: ======================================= The EMBOSS suite includes the water and needle tools for Smith-Waterman algorithm local alignment, and Needleman-Wunsch global alignment. Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Emboss.Applications import NeedleCommandline >>> needle_cline=NeedleCommandline("needle",asequence="alpha.fasta",bsequence="beta.fasta",gapopen=10,gapextend=0.5,outfile="needle.txt") >>> print(needle_cline) needle -outfile=needle.txt -asequence=alpha.fasta -bsequence=beta.fasta -gapopen=10 -gapextend=0.5 >>> stdout,stderr=needle_cline() >>> print(stdout+stderr) Needleman-Wunsch global alignment of two sequences >>> from Bio import AlignIO >>> align=AlignIO.read("needle.txt","emboss") >>> print(align) SingleLetterAlphabet() alignment with 2 rows and 149 columns MV-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTY...KYR HBA_HUMAN MVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRF...KYH HBB_HUMAN >>> if you don’t want a temporary output file to get rid of, use stdout=True rather than the outfile argument and also to read one of the one of the inputs from stdin, use asequence="stdin" Running BLAST online: ===================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Blast import NCBIWWW >>> result_handle = NCBIWWW.qblast("blastn", "nt", "8332116") >>> fasta_string = open("m_cold.fasta").read() >>> result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string) >>> from Bio import SeqIO >>> record = SeqIO.read("m_cold.fasta", format="fasta") >>> result_handle = NCBIWWW.qblast("blastn", "nt", record.seq) >>> record = SeqIO.read("m_cold.fasta", format="fasta") >>> result_handle = NCBIWWW.qblast("blastn", "nt", record.format("fasta")) >>> save_file = open("my_blast.xml", "w") >>> save_file.write(result_handle.read()) >>> save_file.close() >>> result_handle.close() Note you can only read result_handle.read() once. Next time if you read it, it will return empty string Running BLAST locally: ====================== /home/opt/ncbi-blast-2.2.28+/bin/blastx /home/opt/ncbi-blast-2.2.28+/db/nr [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Blast.Applications import NcbiblastxCommandline >>> blastx_exec="/home/opt/ncbi-blast-2.2.28+/bin/blastx" >>> import os >>> assert os.path.isfile(blastx_exec), "blastx executable missing" >>> blastx_cline = NcbiblastxCommandline(blastx_exec,query="opuntia.fasta", db="/home/opt/ncbi-blast-2.2.28+/db/nr", evalue=0.001,outfmt=5, out="opuntia.xml") >>> blastx_cline NcbiblastxCommandline(cmd='/home/opt/ncbi-blast-2.2.28+/bin/blastx', out='opuntia.xml', outfmt=5, query='opuntia.fasta', db='/home/opt/ncbi-blast-2.2.28+/db/nr', evalue=0.001) >>> print(blastx_cline) /home/opt/ncbi-blast-2.2.28+/bin/blastx -out opuntia.xml -outfmt 5 -query opuntia.fasta -db /home/opt/ncbi-blast-2.2.28+/db/nr -evalue 0.001 >>> stdout, stderr = blastx_cline() >>> Parsing BLAST records: ====================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Blast import NCBIXML >>> result_handle=open("my_blast.xml") >>> blast_record=NCBIXML.read(result_handle) >>> result_handle.close() >>> blast_record >>> result_handle=open("opuntia.xml") >>> blast_records=NCBIXML.parse(result_handle) >>> result_handle.close() >>> blast_records >>> E_VALUE_THRESH = 0.04 >>> for alignment in blast_record.alignments: ... for hsp in alignment.hsps: ... if hsp.expect < E_VALUE_THRESH: ... print('****Alignment****') ... print('sequence:'+ alignment.title) ... print('length:' + alignment.length) ... print('e value:' + hsp.expect) ... print(hsp.query[0:75] + '...') ... print(hsp.match[0:75] + '...') ... print(hsp.sbjct[0:75] + '...') ****Alignment**** sequence:gi|568824607|ref|XM_006466626.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X5, mRNA length:878 e value:3.4368500000000001e-99 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ... AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATT... ****Alignment**** sequence:gi|568824605|ref|XM_006466625.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X4, mRNA length:911 e value:3.4368500000000001e-99 AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA... ||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ... AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATT... ... Experimental Bio.SearchIO: ========================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SearchIO /usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release. BiopythonExperimentalWarning) >>> blast_qresult=SearchIO.read("my_blast.xml","blast-xml") >>> print(blast_qresult) Program: blastn (2.2.29+) Query: 40259 (1111) gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthemum cryst... Target: nt Hits: ---- ----- ---------------------------------------------------------- # # HSP ID + description ---- ----- ---------------------------------------------------------- 0 1 gi|568824607|ref|XM_006466626.1| PREDICTED: Citrus sin... 1 1 gi|568824605|ref|XM_006466625.1| PREDICTED: Citrus sin... 2 1 gi|568824603|ref|XM_006466624.1| PREDICTED: Citrus sin... 3 1 gi|568824601|ref|XM_006466623.1| PREDICTED: Citrus sin... 4 1 gi|568824599|ref|XM_006466622.1| PREDICTED: Citrus sin... 5 1 gi|567866318|ref|XM_006425719.1| Citrus clementina hyp... 6 1 gi|567866316|ref|XM_006425718.1| Citrus clementina hyp... 7 1 gi|567866314|ref|XM_006425717.1| Citrus clementina hyp... 8 1 gi|567866312|ref|XM_006425716.1| Citrus clementina hyp... 9 1 gi|224094601|ref|XM_002310151.1| Populus trichocarpa p... 10 1 gi|566180892|ref|XM_006380679.1| Populus trichocarpa c... 11 1 gi|470129329|ref|XM_004300526.1| PREDICTED: Fragaria v... 12 1 gi|359495761|ref|XM_002274845.2| PREDICTED: Vitis vini... 13 1 gi|349709091|emb|FQ378501.1| Vitis vinifera clone SS0A... 14 1 gi|255562758|ref|XM_002522339.1| Ricinus communis COR4... 15 1 gi|565355903|ref|XM_006344753.1| PREDICTED: Solanum tu... 16 1 gi|502153055|ref|XM_004509143.1| PREDICTED: Cicer arie... 17 1 gi|358346403|ref|XM_003637210.1| Medicago truncatula C... 18 1 gi|358344000|ref|XM_003636035.1| Medicago truncatula C... 19 1 gi|571531434|ref|XM_003548859.2| PREDICTED: Glycine ma... 20 1 gi|449448541|ref|XM_004141977.1| PREDICTED: Cucumis sa... 21 1 gi|571438949|ref|XM_003519866.2| PREDICTED: Glycine ma... 22 1 gi|460375241|ref|XM_004233368.1| PREDICTED: Solanum ly... 23 1 gi|225311746|dbj|AK326681.1| Solanum lycopersicum cDNA... 24 1 gi|255762732|gb|GQ370517.1| Salvia miltiorrhiza cold a... 25 1 gi|225428595|ref|XM_002284686.1| PREDICTED: Vitis vini... 26 1 gi|568851338|ref|XM_006479288.1| PREDICTED: Citrus sin... 27 1 gi|567861017|ref|XM_006423100.1| Citrus clementina hyp... 28 1 gi|297819785|ref|XM_002877730.1| Arabidopsis lyrata su... 29 1 gi|86755971|gb|DQ359747.1| Chimonanthus praecox cold a... ~~~ 47 1 gi|242389633|emb|FP100664.1| Phyllostachys edulis cDNA... 48 1 gi|242382816|emb|FP092058.1| Phyllostachys edulis cDNA... 49 1 gi|573926239|ref|XM_006650591.1| PREDICTED: Oryza brac... >>> blat_qresult = SearchIO.read('my_blat.psl', 'blat-psl') >>> print(blat_qresult) Program: blat () Query: mystery_seq (61) Target: Hits: ---- ----- ---------------------------------------------------------- # # HSP ID + description ---- ----- ---------------------------------------------------------- 0 17 chr19 >>> dir(blast_qresult) ['_NON_STICKY_ATTRS', '_QueryResult__marker', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_blast_id', '_description', '_hit_key_function', '_id', '_items', '_transfer_attrs', 'absorb', 'append', 'description', 'fragments', 'hit_filter', 'hit_keys', 'hit_map', 'hits', 'hsp_filter', 'hsp_map', 'hsps', 'id', 'index', 'items', 'iterhit_keys', 'iterhits', 'iteritems', 'param_evalue_threshold', 'param_filter', 'param_gap_extend', 'param_gap_open', 'param_score_match', 'param_score_mismatch', 'pop', 'program', 'reference', 'seq_len', 'sort', 'stat_db_len', 'stat_db_num', 'stat_eff_space', 'stat_entropy', 'stat_hsp_len', 'stat_kappa', 'stat_lambda', 'target', 'version'] >>> for hit in blast_qresult: ... hit ... Hit(id='gi|568824607|ref|XM_006466626.1|', query_id='40259', 1 hsps) Hit(id='gi|568824605|ref|XM_006466625.1|', query_id='40259', 1 hsps) Hit(id='gi|568824603|ref|XM_006466624.1|', query_id='40259', 1 hsps) Hit(id='gi|568824601|ref|XM_006466623.1|', query_id='40259', 1 hsps) Hit(id='gi|568824599|ref|XM_006466622.1|', query_id='40259', 1 hsps) Hit(id='gi|567866318|ref|XM_006425719.1|', query_id='40259', 1 hsps) Hit(id='gi|567866316|ref|XM_006425718.1|', query_id='40259', 1 hsps) Hit(id='gi|567866314|ref|XM_006425717.1|', query_id='40259', 1 hsps) Hit(id='gi|567866312|ref|XM_006425716.1|', query_id='40259', 1 hsps) Hit(id='gi|224094601|ref|XM_002310151.1|', query_id='40259', 1 hsps) Hit(id='gi|566180892|ref|XM_006380679.1|', query_id='40259', 1 hsps) Hit(id='gi|470129329|ref|XM_004300526.1|', query_id='40259', 1 hsps) Hit(id='gi|359495761|ref|XM_002274845.2|', query_id='40259', 1 hsps) Hit(id='gi|349709091|emb|FQ378501.1|', query_id='40259', 1 hsps) Hit(id='gi|255562758|ref|XM_002522339.1|', query_id='40259', 1 hsps) Hit(id='gi|565355903|ref|XM_006344753.1|', query_id='40259', 1 hsps) Hit(id='gi|502153055|ref|XM_004509143.1|', query_id='40259', 1 hsps) Hit(id='gi|358346403|ref|XM_003637210.1|', query_id='40259', 1 hsps) Hit(id='gi|358344000|ref|XM_003636035.1|', query_id='40259', 1 hsps) Hit(id='gi|571531434|ref|XM_003548859.2|', query_id='40259', 1 hsps) Hit(id='gi|449448541|ref|XM_004141977.1|', query_id='40259', 1 hsps) Hit(id='gi|571438949|ref|XM_003519866.2|', query_id='40259', 1 hsps) Hit(id='gi|460375241|ref|XM_004233368.1|', query_id='40259', 1 hsps) Hit(id='gi|225311746|dbj|AK326681.1|', query_id='40259', 1 hsps) Hit(id='gi|255762732|gb|GQ370517.1|', query_id='40259', 1 hsps) Hit(id='gi|225428595|ref|XM_002284686.1|', query_id='40259', 1 hsps) Hit(id='gi|568851338|ref|XM_006479288.1|', query_id='40259', 1 hsps) Hit(id='gi|567861017|ref|XM_006423100.1|', query_id='40259', 1 hsps) Hit(id='gi|297819785|ref|XM_002877730.1|', query_id='40259', 1 hsps) Hit(id='gi|86755971|gb|DQ359747.1|', query_id='40259', 1 hsps) Hit(id='gi|565467969|ref|XM_006291803.1|', query_id='40259', 1 hsps) Hit(id='gi|145339339|ref|NM_114943.4|', query_id='40259', 1 hsps) Hit(id='gi|15810634|gb|AY056356.1|', query_id='40259', 1 hsps) Hit(id='gi|10121842|gb|AF283005.1|', query_id='40259', 1 hsps) Hit(id='gi|13430785|gb|AF360305.1|', query_id='40259', 1 hsps) Hit(id='gi|60317457|gb|AY761065.1|', query_id='40259', 1 hsps) Hit(id='gi|255556172|ref|XM_002519075.1|', query_id='40259', 1 hsps) Hit(id='gi|156567558|gb|EU077497.1|', query_id='40259', 1 hsps) Hit(id='gi|567188085|ref|XM_006403931.1|', query_id='40259', 1 hsps) Hit(id='gi|449502385|ref|XM_004161578.1|', query_id='40259', 1 hsps) Hit(id='gi|449455211|ref|XM_004145299.1|', query_id='40259', 1 hsps) Hit(id='gi|449474965|ref|XM_004154286.1|', query_id='40259', 1 hsps) Hit(id='gi|46577795|gb|AY587773.1|', query_id='40259', 1 hsps) Hit(id='gi|305690597|gb|HQ010041.1|', query_id='40259', 1 hsps) Hit(id='gi|566187445|ref|XM_002313788.2|', query_id='40259', 1 hsps) Hit(id='gi|470148635|ref|XM_004309813.1|', query_id='40259', 1 hsps) Hit(id='gi|357528757|gb|GU370376.1|', query_id='40259', 1 hsps) Hit(id='gi|242389633|emb|FP100664.1|', query_id='40259', 1 hsps) Hit(id='gi|242382816|emb|FP092058.1|', query_id='40259', 1 hsps) Hit(id='gi|573926239|ref|XM_006650591.1|', query_id='40259', 1 hsps) >>> len(blast_qresult) 50 >>> blast_qresult[0] Hit(id='gi|568824607|ref|XM_006466626.1|', query_id='40259', 1 hsps) >>> blast_qresult[-1] Hit(id='gi|573926239|ref|XM_006650591.1|', query_id='40259', 1 hsps) >>> blast_slice=blast_qresult[:3] >>> print(blast_slice) Program: blastn (2.2.29+) Query: 40259 (1111) gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthemum cryst... Target: nt Hits: ---- ----- ---------------------------------------------------------- # # HSP ID + description ---- ----- ---------------------------------------------------------- 0 1 gi|568824607|ref|XM_006466626.1| PREDICTED: Citrus sin... 1 1 gi|568824605|ref|XM_006466625.1| PREDICTED: Citrus sin... 2 1 gi|568824603|ref|XM_006466624.1| PREDICTED: Citrus sin... >>> blast_qresult['gi|568824607|ref|XM_006466626.1|'] Hit(id='gi|568824607|ref|XM_006466626.1|', query_id='40259', 1 hsps) >>> blast_qresult.hits [Hit(id='gi|568824607|ref|XM_006466626.1|', query_id='40259', 1 hsps), Hit(id='gi|568824605|ref|XM_006466625.1|', query_id='40259', 1 hsps), Hit(id='gi|568824603|ref|XM_006466624.1|', query_id='40259', 1 hsps), Hit(id='gi|568824601|ref|XM_006466623.1|', query_id='40259', 1 hsps), Hit(id='gi|568824599|ref|XM_006466622.1|', query_id='40259', 1 hsps), Hit(id='gi|567866318|ref|XM_006425719.1|', query_id='40259', 1 hsps), Hit(id='gi|567866316|ref|XM_006425718.1|', query_id='40259', 1 hsps), Hit(id='gi|567866314|ref|XM_006425717.1|', query_id='40259', 1 hsps), Hit(id='gi|567866312|ref|XM_006425716.1|', query_id='40259', 1 hsps), Hit(id='gi|224094601|ref|XM_002310151.1|', query_id='40259', 1 hsps), Hit(id='gi|566180892|ref|XM_006380679.1|', query_id='40259', 1 hsps), Hit(id='gi|470129329|ref|XM_004300526.1|', query_id='40259', 1 hsps), Hit(id='gi|359495761|ref|XM_002274845.2|', query_id='40259', 1 hsps), Hit(id='gi|349709091|emb|FQ378501.1|', query_id='40259', 1 hsps), Hit(id='gi|255562758|ref|XM_002522339.1|', query_id='40259', 1 hsps), Hit(id='gi|565355903|ref|XM_006344753.1|', query_id='40259', 1 hsps), Hit(id='gi|502153055|ref|XM_004509143.1|', query_id='40259', 1 hsps), Hit(id='gi|358346403|ref|XM_003637210.1|', query_id='40259', 1 hsps), Hit(id='gi|358344000|ref|XM_003636035.1|', query_id='40259', 1 hsps), Hit(id='gi|571531434|ref|XM_003548859.2|', query_id='40259', 1 hsps), Hit(id='gi|449448541|ref|XM_004141977.1|', query_id='40259', 1 hsps), Hit(id='gi|571438949|ref|XM_003519866.2|', query_id='40259', 1 hsps), Hit(id='gi|460375241|ref|XM_004233368.1|', query_id='40259', 1 hsps), Hit(id='gi|225311746|dbj|AK326681.1|', query_id='40259', 1 hsps), Hit(id='gi|255762732|gb|GQ370517.1|', query_id='40259', 1 hsps), Hit(id='gi|225428595|ref|XM_002284686.1|', query_id='40259', 1 hsps), Hit(id='gi|568851338|ref|XM_006479288.1|', query_id='40259', 1 hsps), Hit(id='gi|567861017|ref|XM_006423100.1|', query_id='40259', 1 hsps), Hit(id='gi|297819785|ref|XM_002877730.1|', query_id='40259', 1 hsps), Hit(id='gi|86755971|gb|DQ359747.1|', query_id='40259', 1 hsps), Hit(id='gi|565467969|ref|XM_006291803.1|', query_id='40259', 1 hsps), Hit(id='gi|145339339|ref|NM_114943.4|', query_id='40259', 1 hsps), Hit(id='gi|15810634|gb|AY056356.1|', query_id='40259', 1 hsps), Hit(id='gi|10121842|gb|AF283005.1|', query_id='40259', 1 hsps), Hit(id='gi|13430785|gb|AF360305.1|', query_id='40259', 1 hsps), Hit(id='gi|60317457|gb|AY761065.1|', query_id='40259', 1 hsps), Hit(id='gi|255556172|ref|XM_002519075.1|', query_id='40259', 1 hsps), Hit(id='gi|156567558|gb|EU077497.1|', query_id='40259', 1 hsps), Hit(id='gi|567188085|ref|XM_006403931.1|', query_id='40259', 1 hsps), Hit(id='gi|449502385|ref|XM_004161578.1|', query_id='40259', 1 hsps), Hit(id='gi|449455211|ref|XM_004145299.1|', query_id='40259', 1 hsps), Hit(id='gi|449474965|ref|XM_004154286.1|', query_id='40259', 1 hsps), Hit(id='gi|46577795|gb|AY587773.1|', query_id='40259', 1 hsps), Hit(id='gi|305690597|gb|HQ010041.1|', query_id='40259', 1 hsps), Hit(id='gi|566187445|ref|XM_002313788.2|', query_id='40259', 1 hsps), Hit(id='gi|470148635|ref|XM_004309813.1|', query_id='40259', 1 hsps), Hit(id='gi|357528757|gb|GU370376.1|', query_id='40259', 1 hsps), Hit(id='gi|242389633|emb|FP100664.1|', query_id='40259', 1 hsps), Hit(id='gi|242382816|emb|FP092058.1|', query_id='40259', 1 hsps), Hit(id='gi|573926239|ref|XM_006650591.1|', query_id='40259', 1 hsps)] >>> blast_qresult.hit_keys ['gi|568824607|ref|XM_006466626.1|', 'gi|568824605|ref|XM_006466625.1|', 'gi|568824603|ref|XM_006466624.1|', 'gi|568824601|ref|XM_006466623.1|', 'gi|568824599|ref|XM_006466622.1|', 'gi|567866318|ref|XM_006425719.1|', 'gi|567866316|ref|XM_006425718.1|', 'gi|567866314|ref|XM_006425717.1|', 'gi|567866312|ref|XM_006425716.1|', 'gi|224094601|ref|XM_002310151.1|', 'gi|566180892|ref|XM_006380679.1|', 'gi|470129329|ref|XM_004300526.1|', 'gi|359495761|ref|XM_002274845.2|', 'gi|349709091|emb|FQ378501.1|', 'gi|255562758|ref|XM_002522339.1|', 'gi|565355903|ref|XM_006344753.1|', 'gi|502153055|ref|XM_004509143.1|', 'gi|358346403|ref|XM_003637210.1|', 'gi|358344000|ref|XM_003636035.1|', 'gi|571531434|ref|XM_003548859.2|', 'gi|449448541|ref|XM_004141977.1|', 'gi|571438949|ref|XM_003519866.2|', 'gi|460375241|ref|XM_004233368.1|', 'gi|225311746|dbj|AK326681.1|', 'gi|255762732|gb|GQ370517.1|', 'gi|225428595|ref|XM_002284686.1|', 'gi|568851338|ref|XM_006479288.1|', 'gi|567861017|ref|XM_006423100.1|', 'gi|297819785|ref|XM_002877730.1|', 'gi|86755971|gb|DQ359747.1|', 'gi|565467969|ref|XM_006291803.1|', 'gi|145339339|ref|NM_114943.4|', 'gi|15810634|gb|AY056356.1|', 'gi|10121842|gb|AF283005.1|', 'gi|13430785|gb|AF360305.1|', 'gi|60317457|gb|AY761065.1|', 'gi|255556172|ref|XM_002519075.1|', 'gi|156567558|gb|EU077497.1|', 'gi|567188085|ref|XM_006403931.1|', 'gi|449502385|ref|XM_004161578.1|', 'gi|449455211|ref|XM_004145299.1|', 'gi|449474965|ref|XM_004154286.1|', 'gi|46577795|gb|AY587773.1|', 'gi|305690597|gb|HQ010041.1|', 'gi|566187445|ref|XM_002313788.2|', 'gi|470148635|ref|XM_004309813.1|', 'gi|357528757|gb|GU370376.1|', 'gi|242389633|emb|FP100664.1|', 'gi|242382816|emb|FP092058.1|', 'gi|573926239|ref|XM_006650591.1|'] >>> 'gi|565467969|ref|XM_006291803.1|' in blast_qresult True >>> 'gi|565467969|ref|XM_006291803.1|' in blast_qresult True >>> for hit in blast_qresult[:5]: ... print("%s %i" % (hit.id, hit.seq_len)) ... gi|568824607|ref|XM_006466626.1| 878 gi|568824605|ref|XM_006466625.1| 911 gi|568824603|ref|XM_006466624.1| 894 gi|568824601|ref|XM_006466623.1| 922 gi|568824599|ref|XM_006466622.1| 966 >>> sort_key = lambda hit: hit.seq_len >>> sorted_qresult = blast_qresult.sort(key=sort_key, reverse=True, in_place=False) >>> for hit in sorted_qresult[:5]: ... print("%s %i" % (hit.id, hit.seq_len))... gi|567866316|ref|XM_006425718.1| 1202 gi|566180892|ref|XM_006380679.1| 1182 gi|46577795|gb|AY587773.1| 1176 gi|567188085|ref|XM_006403931.1| 1171 gi|565355903|ref|XM_006344753.1| 1153 >>> The advantage of having the in_place flag here is that we’re preserving the native ordering, so we may use it again later. You should note that this is not the default behavior of QueryResult.sort, however, which is why we needed to set the in_place flag to True explicitly. >>> filter_func = lambda hit: len(hit.hsps) > 1 >>> filtered_qresult = blast_qresult.hit_filter(filter_func) >>> len(filtered_qresult) 0 >>> blast_qresult QueryResult(id='40259', 50 hits) >>> def map_func(hit): ... hit.id = hit.id.split('|')[3] # renames 'gi|301171322|ref|NR_035857.1|' to 'NR_035857.1' ... return hit ... >>> mapped_qresult = blast_qresult.hit_map(map_func) >>> for hit in mapped_qresult[:5]: ... print(hit.id) ... XM_006466626.1 XM_006466625.1 XM_006466624.1 XM_006466623.1 XM_006466622.1 >>> Bio.SearchIO Hit: ================= [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SearchIO /usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release. BiopythonExperimentalWarning) >>> blast_qresult = SearchIO.read('my_blast.xml', 'blast-xml') >>> blast_hit = blast_qresult[3] >>> print(blast_hit) Query: 40259 gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthemum crystal... Hit: gi|568824601|ref|XM_006466623.1| (922) PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein... HSPs: ---- -------- --------- ------ --------------- --------------------- # E-value Bit score Span Query range Hit range ---- -------- --------- ------ --------------- --------------------- 0 3.4e-99 372.78 596 [63:655] [136:725] >>> blat_qresult = SearchIO.read('my_blat.psl', 'blat-psl') >>> blat_hit = blat_qresult[0] >>> print(blat_hit) Query: mystery_seq Hit: chr19 (59128983) HSPs: ---- -------- --------- ------ --------------- --------------------- # E-value Bit score Span Query range Hit range ---- -------- --------- ------ --------------- --------------------- 0 ? ? ? [0:61] [54204480:54204541] 1 ? ? ? [0:61] [54233104:54264463] 2 ? ? ? [0:61] [54254477:54260071] 3 ? ? ? [1:61] [54210720:54210780] 4 ? ? ? [0:60] [54198476:54198536] 5 ? ? ? [0:61] [54265610:54265671] 6 ? ? ? [0:61] [54238143:54240175] 7 ? ? ? [0:60] [54189735:54189795] 8 ? ? ? [0:61] [54185425:54185486] 9 ? ? ? [0:60] [54197657:54197717] 10 ? ? ? [0:61] [54255662:54255723] 11 ? ? ? [0:61] [54201651:54201712] 12 ? ? ? [8:60] [54206009:54206061] 13 ? ? ? [10:61] [54178987:54179038] 14 ? ? ? [8:61] [54212018:54212071] 15 ? ? ? [8:51] [54234278:54234321] 16 ? ? ? [8:61] [54238143:54238196] >>> for hsp in blast_hit: ... hsp ... HSP(hit_id='gi|568824601|ref|XM_006466623.1|', query_id='40259', 1 fragments) >>> blat_hit[0] HSP(hit_id='chr19', query_id='mystery_seq', 1 fragments) >>> sliced_hit=blat_hit[4:9] >>> print(sliced_hit) Query: mystery_seq Hit: chr19 (59128983) HSPs: ---- -------- --------- ------ --------------- --------------------- # E-value Bit score Span Query range Hit range ---- -------- --------- ------ --------------- --------------------- 0 ? ? ? [0:60] [54198476:54198536] 1 ? ? ? [0:61] [54265610:54265671] 2 ? ? ? [0:61] [54238143:54240175] 3 ? ? ? [0:60] [54189735:54189795] 4 ? ? ? [0:61] [54185425:54185486] >>> Bio.SearchIO HSP: ================= [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SearchIO /usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release. BiopythonExperimentalWarning) >>> blast_qresult = SearchIO.read('my_blast.xml', 'blast-xml') >>> blast_hsp = blast_qresult[0][0] # first hit, first hsp >>> print(blast_hsp) Query: 40259 gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthe... Hit: gi|568824607|ref|XM_006466626.1| PREDICTED: Citrus sinensis cold... Query range: [63:655] (1) Hit range: [92:681] (1) Quick stats: evalue 3.4e-99; bitscore 372.78 Fragments: 1 (596 columns) Query - AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAAT~~~GTCTG ||||||||||| |||| || ||||| |||||||||||| | || | || | ~~~||||| Hit - AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAG~~~GTCTG >>> blast_hsp.query_range (63, 655) >>> blast_hsp.evalue 3.4368500000000001e-99 >>> blast_hsp.hit_start 92 >>> blast_hsp.query_span 592 >>> blast_hsp.aln_span 596 >>> blast_hsp.gap_num 11 >>> blast_hsp.ident_num 443 >>> blast_hsp.__dict__.keys() ['bitscore', 'evalue', 'ident_num', 'gap_num', 'bitscore_raw', 'pos_num', '_items'] >>> blast_hsp.query SeqRecord(seq=Seq('AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGG...CTG', DNAAlphabet()), id='40259', name='aligned query sequence', description="gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthemum crystallinum cDNA 5' similar to cold acclimation protein, mRNA sequence", dbxrefs=[]) >>> blast_hsp.hit SeqRecord(seq=Seq('AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAG...CTG', DNAAlphabet()), id='gi|568824607|ref|XM_006466626.1|', name='aligned hit sequence', description='PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X5, mRNA', dbxrefs=[]) >>> print(blast_hsp.aln) DNAAlphabet() alignment with 2 rows and 596 columns AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAA...CTG 40259 AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGAT...CTG gi|568824607|ref|XM_006466626.1| >>> blat_qresult = SearchIO.read('my_blat.psl', 'blat-psl') >>> blat_hsp = blat_qresult[0][0] # first hit, first hsp >>> print(blat_hsp) Query: mystery_seq Hit: chr19 Query range: [0:61] (1) Hit range: [54204480:54204541] (1) Quick stats: evalue ?; bitscore ? Fragments: 1 (? columns) >>> blat_hsp.hit >>> blat_hsp.hit is None True >>> blat_hsp.query is None True >>> blat_hsp.aln is None True >>> blat_hsp.query_span 61 >>> blat_hsp.hit_span 61 >>> blat_hsp.score 61 >>> blat_hsp.mismatch_num 0 >>> blat_hsp2=blat_qresult[0][1] >>> print(blat_hsp2) Query: mystery_seq Hit: chr19 Query range: [0:61] (1) Hit range: [54233104:54264463] (1) Quick stats: evalue ?; bitscore ? Fragments: --- -------------- ---------------------- ---------------------- # Span Query range Hit range --- -------------- ---------------------- ---------------------- 0 ? [0:18] [54233104:54233122] 1 ? [18:61] [54264420:54264463] >>> blat_hsp2.hit_range (54233104, 54264463) >>> blat_hsp2.hit_range_all [(54233104, 54233122), (54264420, 54264463)] >>> blat_hsp2.hit_span 31359 >>> blat_hsp2.hit_span_all [18, 43] >>> blat_hsp2.hit_inter_ranges [(54233122, 54264420)] >>> blat_hsp2.hit_inter_spans [31298] >>> blat_hsp2.is_fragmented True >>> blat_hsp.is_fragmented False >>> Bio.SearchIO HSPFragment: ========================= [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SearchIO /usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release. BiopythonExperimentalWarning) >>> blast_qresult=SearchIO.read('my_blast.xml','blast-xml') >>> blast_frag = blast_qresult[0][0][0] # first hit, first hsp, first fragment >>> print(blast_frag) Query: 40259 gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthe... Hit: gi|568824607|ref|XM_006466626.1| PREDICTED: Citrus sinensis cold... Query range: [63:655] (1) Hit range: [92:681] (1) Fragments: 1 (596 columns) Query - AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAAT~~~GTCTG ||||||||||| |||| || ||||| |||||||||||| | || | || | ~~~||||| Hit - AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAG~~~GTCTG >>> blat_qresult = SearchIO.read('my_blat.psl', 'blat-psl') >>> blat_frag = blat_qresult[0][0][0] # first hit, first hsp, first fragment >>> print(blat_frag) Query: mystery_seq Hit: chr19 Query range: [0:61] (1) Hit range: [54204480:54204541] (1) Fragments: 1 (? columns) >>> blast_frag.query_start 63 >>> blast_frag.hit_strand 1 >>> blast_frag.hit SeqRecord(seq=Seq('AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAG...CTG', DNAAlphabet()), id='gi|568824607|ref|XM_006466626.1|', name='aligned hit sequence', description='PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X5, mRNA', dbxrefs=[]) >>> Bio.SearchIO tab-delimited files: ================================= [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SearchIO /usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release. BiopythonExperimentalWarning) >>> qresult = SearchIO.read('tab_2226_tblastn_003.txt', 'blast-tab') >>> qresult QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits) >>> qresult2 = SearchIO.read('tab_2226_tblastn_007.txt', 'blast-tab', comments=True) >>> qresult2 QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits) >>> for qresult in qresult: ... print(qresult.id) ... gi|145479850|ref|XM_001425911.1| gi|72012412|ref|XM_777959.1| gi|115975252|ref|XM_001180111.1| >>> One can similarly use SearchIO.index and SearchIO.index_db (storing in sqlite3 database) [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import SearchIO /usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release. BiopythonExperimentalWarning) >>> idx = SearchIO.index('tab_2226_tblastn_001.txt', 'blast-tab') >>> sorted(idx.keys()) ['gi|11464971:4-101', 'gi|16080617|ref|NP_391444.1|'] >>> idx['gi|16080617|ref|NP_391444.1|'] QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits) >>> idx = SearchIO.index('tab_2226_tblastn_005.txt', 'blast-tab', comments=True) >>> sorted(idx.keys()) ['gi|11464971:4-101', 'gi|16080617|ref|NP_391444.1|', 'random_s00'] >>> idx['gi|16080617|ref|NP_391444.1|'] QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits) >>> key_function = lambda id: id.upper() # capitalizes the keys >>> idx = SearchIO.index('tab_2226_tblastn_001.txt', 'blast-tab', key_function=key_function) >>> sorted(idx.keys()) ['GI|11464971:4-101', 'GI|16080617|REF|NP_391444.1|'] >>> idx['GI|16080617|REF|NP_391444.1|'] QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits) >>> qresults=SearchIO.parse('mirna.xml','blast-xml') >>> SearchIO.write(qresults,'results.tab','blast-tab') (3, 239, 277, 277) >>> SearchIO.convert('mirna.xml', 'blast-xml', 'results.tab', 'blast-tab') (3, 239, 277, 277) GenomeDiagram package: ====================== [uzi@quince-srv2 ~/biopython]$ python Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22) [GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from reportlab.lib import colors >>> from reportlab.lib.units import cm >>> from Bio.Graphics import GenomeDiagram >>> from Bio import SeqIO >>> record=SeqIO.read("NC_005816.gb","genbank") >>> gd_diagram = GenomeDiagram.Diagram("Yersinia pestis biovar Microtus plasmid pPCP1") >>> gd_track_for_features = gd_diagram.new_track(1, name="Annotated Features")>>> gd_feature_set = gd_track_for_features.new_set() >>> for feature in record.features: ... if feature.type != "gene": ... #Exclude this feature ... continue ... if len(gd_feature_set) % 2 == 0: ... color = colors.blue ... else: ... color = colors.lightblue ... gd_feature_set.add_feature(feature,color=color,label=True) ... >>> gd_diagram.draw(format="linear",orientation="landscape",pagesize='A4',fragments=4,start=0,end=len(record)) >>> gd_diagram.write("plasmid_linear.pdf","PDF") >>> gd_diagram.write("plasmid_linear.eps","EPS") >>> gd_diagram.write("plasmid_linear.svg","SVG") >>> gd_diagram.write("plasmid_linear.png","PNG") >>> gd_diagram.draw(format="circular",circular=True,pagesize=(20*cm,20*cm), start=0,end=len(record),circle_core=0.7) >>> gd_diagram.write("plasmid_circular.pdf","PDF")