==============================================
Umer Zeeshan Ijaz
http://userweb.eng.gla.ac.uk/umer.ijaz
Biopython Tutorial + File Formats
Version 0.1
==============================================
CLUSTALW file format example:
=============================
CLUSTAL W (1.81) multiple sequence alignment
CYS1_DICDI -----MKVILLFVLAVFTVFVSS---------------RGIPPEEQ------------SQ
ALEU_HORVU MAHARVLLLALAVLATAAVAVASSSSFADSNPIRPVTDRAASTLESAVLGALGRTRHALR
CATH_HUMAN ------MWATLPLLCAGAWLLGV--------PVCGAAELSVNSLEK------------FH
* :*.. : :. . . *. .
CYS1_DICDI FLEFQDKFNKKY-SHEEYLERFEIFKSNLGKIEELNLIAINHKADTKFGVNKFADLSSDE
ALEU_HORVU FARFAVRYGKSYESAAEVRRRFRIFSESLEEVRSTN----RKGLPYRLGINRFSDMSWEE
CATH_HUMAN FKSWMSKHRKTY-STEEYHHRLQTFASNWRKINAHN----NGNHTFKMALNQFSDMSFAE
* : .. *.* * * *: * .. :: * . .:.:*.*:*:* *
CYS1_DICDI FKNYYLNNKEAIFTDDLPVADYLDDEFINSIPTAFDWRTRG-AVTPVKNQGQCGSCWSFS
ALEU_HORVU FQATRL-GAAQTCSATLAGNHLMRDA--AALPETKDWREDG-IVSPVKNQAHCGSCWTFS
CATH_HUMAN IKHKYLWSEPQNCSAT--KSNYLRGT--GPYPPSVDWRKKGNFVSPVKNQGACGSCWTFS
:: * . : : . . * : *** * *:*****. *****:**
CYS1_DICDI TTGNVEGQHFISQNKLVSLSEQNLVDCDHECMEYEGEEACDEGCNGGLQPNAYNYIIKNG
ALEU_HORVU TTGALEAAYTQATGKNISLSEQQLVDCAGGFNNF--------GCNGGLPSQAFEYIKYNG
CATH_HUMAN TTGALESAIAIATGKMLSLAEQQLVDCAQDFNNY--------GCQGGLPSQAFEYILYNK
*** :*. : .* :**:**:**** :: **:*** .:*::** *
CYS1_DICDI GIQTESSYPYTAETGTQCNFNSANIGAKISNFTMIP-KNETVMAGYIVSTGPLAIAADAV
ALEU_HORVU GIDTEESYPYKGVNGV-CHYKAENAAVQVLDSVNITLNAEDELKNAVGLVRPVSVAFQVI
CATH_HUMAN GIMGEDTYPYQGKDGY-CKFQPGKAIGFVKDVANITIYDEEAMVEAVALYNPVSFAFEVT
** *.:*** . * *:::. : : : . *. * : : *::.* :.
CYS1_DICDI E-WQFYIGGVF-DIPCN--PNSLDHGILIVGYSAKNTIFRKNMPYWIVKNSWGADWGEQG
ALEU_HORVU DGFRQYKSGVYTSDHCGTTPDDVNHAVLAVGYGVENGV-----PYWLIKNSWGADWGDNG
CATH_HUMAN QDFMMYRTGIYSSTSCHKTPDKVNHAVLAVGYGEKNGI-----PYWIVKNSWGPQWGMNG
: : * *:: . * *:.::*.:* ***. :* : ***::*****.:** :*
CYS1_DICDI YIYLRRGKNTCGVSNFVSTSII--
ALEU_HORVU YFKMEMGKNMCAIATCASYPVVAA
CATH_HUMAN YFLIERGKNMCGLAACASYPIPLV
*: : *** *.:: .* .:
Stolkholm file format:
======================
# STOCKHOLM 1.0
#=GF ID CBS
#=GF AC PF00571
#=GF DE CBS domain
#=GF AU Bateman A
#=GF CC CBS domains are small intracellular modules mostly found
#=GF CC in 2 or four copies within a protein.
#=GF SQ 5
#=GS O31698/18-71 AC O31698
#=GS O83071/192-246 AC O83071
#=GS O83071/259-312 AC O83071
#=GS O31698/88-139 AC O31698
#=GS O31698/88-139 OS Bacillus subtilis
O83071/192-246 MTCRAQLIAVPRASSLAEAIACAQKMRVSRVPVYERS
#=GR O83071/192-246 SA 9998877564535242525515252536463774777
O83071/259-312 MQHVSAPVFVFECTRLAYVQHKLRAHSRAVAIVLDEY
#=GR O83071/259-312 SS CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEE
O31698/18-71 MIEADKVAHVQVGNNLEHALLVLTKTGYTAIPVLDPS
#=GR O31698/18-71 SS CCCHHHHHHHHHHHHHHHEEEEEEEEEEEEEEEEHHH
O31698/88-139 EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE
#=GR O31698/88-139 SS CCCCCCCHHHHHHHHHHHHEEEEEEEEEEEEEEEEEH
#=GC SS_cons CCCCCHHHHHHHHHHHHHEEEEEEEEEEEEEEEEEEH
O31699/88-139 EVMLTDIPRLHINDPIMKGFGMVINN..GFVCVENDE
#=GR O31699/88-139 AS ________________*____________________
#=GR O31699/88-139 IN ____________1____________2______0____
//
Blast XML file format example:
==============================
blastn
BLASTN 2.2.12 [Aug-07-2005]
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.
nr
gi|1348916|gb|G26684.1|G26684
human STS STS_D11570, sequence tagged site
285
10
1
-3
5
2
1
gi|1348916|gb|G26684.1|G26684
human STS STS_D11570, sequence tagged site
285
1
gi|9950606|gb|AE004854.1|
Pseudomonas aeruginosa PAO1, section 415 of 529 of the complete genome
AE004854
11884
1
38.1576
19
1.0598
68
86
6012
6030
1
1
19
19
0
19
CAGGCCAGCGACTTCTGGG
CAGGCCAGCGACTTCTGGG
|||||||||||||||||||
2
gi|15073988|emb|AL591786.1|SME591786
Sinorhizobium meliloti 1021 complete chromosome; segment 5/12
AL591786
299350
1
36.1753
18
4.18768
204
224
83648
83628
1
-1
20
20
0
21
TGAAAGGAAATNAAAATGGAA
TGAAAGGAAATCAAAATGGAA
||||||||||| |||||||||
371021
1233631384
0
0
0.710603
1.37406
1.30725
EMBL flat file format:
======================
ID SC10H5 standard; DNA; PRO; 4870 BP.
XX
AC AL031232;
XX
DE Streptomyces coelicolor cosmid 10H5.
XX
KW integral membrane protein.
XX
OS Streptomyces coelicolor
OC Eubacteria; Firmicutes; Actinomycetes; Streptomycetes;
OC Streptomycetaceae; Streptomyces.
XX
RN [1]
RP 1-4870
RA Oliver K., Harris D.;
RT ;
RL Unpublished.
XX
RN [2]
RP 1-4870
RA Parkhill J., Barrell B.G., Rajandream M.A.;
RT ;
RL Submitted (10-AUG-1998) to the EMBL/GenBank/DDBJ databases.
RL Streptomyces coelicolor sequencing project,
RL Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA
RL E-mail: barrell@sanger.ac.uk
RL Cosmids supplied by Prof. David A. Hopwood, [3]
RL John Innes Centre, Norwich Research Park, Colney,
RL Norwich, Norfolk NR4 7UH, UK.
XX
RN [3]
RP 1-4870
RA Redenbach M., Kieser H.M., Denapaite D., Eichner A.,
RA Cullum J., Kinashi H., Hopwood D.A.;
RT "A set of ordered cosmids and a detailed genetic and physical
RT map for the 8 Mb Streptomyces coelicolor A3(2) chromosome.";
RL Mol. Microbiol. 21(1):77-96(1996).
XX
CC Notes:
CC
CC Streptomyces coelicolor sequencing at The Sanger Centre is funded
CC by the BBSRC.
CC
CC Details of S. coelicolor sequencing at the Sanger Centre
CC are available on the World Wide Web.
CC (URL; http://www.sanger.ac.uk/Projects/S_coelicolor/)
CC
CC CDS are numbered using the following system eg SC7B7.01c.
CC SC (S. coelicolor), 7B7 (cosmid name), .01 (first CDS),
CC c (complementary strand).
CC
CC The more significant matches with motifs in the PROSITE
CC database are also included but some of these may be fortuitous.
CC
CC The length in codons is given for each CDS.
CC
CC Usually the highest scoring match found by fasta -o is given for
CC CDS which show significant similarity to other CDS in the database.
CC The position of possible ribosome binding site sequences are
CC given where these have been used to deduce the initiation codon.
CC
CC Gene prediction is based on positional base preference in codons
CC using a specially developed Hidden Markov Model (Krogh et al.,
CC Nucleic Acids Research, 22(22):4768-4778(1994)) and the FramePlot
CC program of Bibb et al., Gene 30:157-66(1984) as implemented at
CC http://www.nih.go.jp/~jun/cgi-bin/frameplot.pl. CAUTION: We may
CC not have predicted the correct initiation codon. Where possible
CC we choose an initiation codon (atg, gtg, ttg or (att)) which is
CC preceded by an upstream ribosome binding site sequence (optimally
CC 5-13bp before the initiation codon). If this cannot be identified
CC we choose the most upstream initiation codon.
CC
CC IMPORTANT: This sequence MAY NOT be the entire insert of
CC the sequenced clone. It may be shorter because we only
CC sequence overlapping sections once, or longer, because we
CC arrange for a small overlap between neighbouring submissions.
CC
CC Cosmid 10H5 lies to the right of 3A7 on the AseI-B genomic restriction
CC fragment.
XX
FH Key Location/Qualifiers
FH
FT source 1..4870
FT /organism="Streptomyces coelicolor"
FT /strain="A3(2)"
FT /clone="cosmid 10H5"
FT CDS complement(<1..327)
FT /note="SC10H5.01c, unknown, partial CDS, len >109 aa;
FT possible integral membrane protein"
FT /gene="SC10H5.01c"
FT /product="hypothetical protein SC10H5.01c"
FT CDS complement(350..805)
FT /note="SC10H5.02c, probable integral membrane protein, len:
FT 151 aa; similar to S. coelicolor hypothetical protein
FT TR:O54194 (EMBL:AL021411) SC7H1.35 (155 aa), fasta scores;
FT opt: 431 z-score: 749.8 E(): 0, 53.5% identity in 114 aa
FT overlap."
FT /product="putative integral membrane protein"
FT /gene="SC10H5.02c"
FT RBS complement(812..815)
FT /note="possible RBS upstream of SC10H5.02c"
FT CDS complement(837..1301)
FT /note="SC10H5.03c, probable integral membrane protein, len:
FT 154 aa"
FT /product="putative integral membrane protein"
FT /gene="SC10H5.03c"
FT RBS complement(1308..1312)
FT /note="possible RBS upstream of SC10H5.03c"
FT CDS complement(1427..1735)
FT /note="SC10H5.04c, unknown, len: 103 aa; possible membrane"
FT /gene="SC10H5.04c"
FT /product="hypothetical protein SC10H5.04c"
FT RBS complement(1738..1741)
FT /note="possible RBS upstream of SC10H5.05c"
FT misc_feature 1800^1801
FT /note="Zero-length feature added to test Bioperl parsing"
FT CDS 1933..2022
FT /note="SC10H5.05, questionable ORF, len: 29 aa"
FT /gene="SC10H5.05"
FT /product="hypothetical protein SC10H5.05"
FT CDS 2019..2642
FT /note="SC10H5.06, probable membrane protein, len: 207 aa;
FT similar to S. coelicolor TR:O54192 SC7H1.33c (191 aa),
FT fasta scores; opt: 312 z-score: 355.2 E(): 1.6e-12, 36.8%
FT identity in 182 aa overlap"
FT /product="putative membrane protein"
FT /gene="SC10H5.06"
FT RBS 2627..2631
FT /note="possible RBS upstream of SC10H5.07"
FT CDS 2639..4048
FT /note="SC10H5.07, unknown, len: 469 aa"
FT /gene="SC10H5.07"
FT /product="hypothetical protein SC10H5.07"
FT CDS complement(4100..4297)
FT /note="SC10H5.08c, unknown, len: 65 aa"
FT /gene="SC10H5.08c"
FT /product="hypothetical protein SC10H5.08c"
FT RBS complement(4314..4319)
FT /note="possible RBS upstream of SC10H5.08c"
FT CDS complement(4439..>4870)
FT /note="SC10H5.09c, probable integral membrane protein,
FT partial CDS len: >143 aa; some similarity in C-terminus to
FT S. coelicolor hypothetical protein TR:O54106
FT (EMBL:AL021529) SC10A5.15 (114 aa), fasta scores; opt: 145
FT z-score: 233.8 E(): 9.2e-06, 33.3% identity in 81 aa
FT overlap. Overlaps and extends SC3A7.01c"
FT /product="putative integral membrane protein"
FT /gene="SC10H5.09c"
FT misc_feature 4769..4870
FT /note="overlap with cosmid 3A7 from 1 to 102"
XX
SQ Sequence 4870 BP; 769 A; 1717 C; 1693 G; 691 T; 0 other;
gatcagtaga cccagcgaca gcagggcggg gcccagcagg ccggccgtgg cgtagagcgc 60
gaggacggcg accggcgtgg ccaccgacag gatggctgcg gcgacgcgga cgacaccgga 120
gtgtgccagg gcccaccaca cgccgatggc cgcgagcgcg agtcccgcgc tgccgaacag 180
ggcccacagc acactgcgca gaccggcggc cacgagtggc gccaggacgg tgcccagcag 240
gagcagcagg gtgacgtggg cgcgcgctgc actgtggccg ccccgtccgc ccgacgcgcg 300
cggctcgtca tctcgcggtc ccaccaccgg tcggccccat tactcgtcct caaccctgtg 360
gcgactgacg ttccccggac aggtcgtacc gattgccgcc acgccccacc acgcacaggg 420
cccagacgac gaagcctgac atggtgatca tgacgacgga ccacaccggg tagtacggca 480
gcgagaggaa gttggcgatg atcaccagcc cggcgatggc gaccccggtg acacgtgccc 540
acatcgccgt tttgagcagc ccggcgctga cgaccatggc gagcgcgccg agcgcgagat 600
ggatccaccc ccacccggtg agatcgaact ggaaaacgta gttgggcgtg gtgacgaaga 660
cgtcgtcctc ggcgatggcc atgatgcccc ggaagaggct gagcagcccg gcgaggaaga 720
gcatcaccgc cgcgaaggcg gtaaggcccg tcgcccattc ctgcctcgcg gtgtgtgccg 780
ggtggtgggt atgtgacgtg gtcatctcgg acctcgtttc gtggaatgcg gatgcttcag 840
cgagcggagg cgccggtgcc cgccgcgccc gtgtgccctg ccgggccgtg accggacagg 900
accaattcct tcgccttgcg gaactcctcg tccgtgatgg caccccggtc tcggatctcg 960
gagagccggg ccagctcgtc gacgctgctg gacccgccgc ccacggtctt cctgatgtag 1020
gcgtcgaact cctcctgctg agcccgtgcc cgcgttgtct cccggctgcc catgttcttg 1080
ccgcgagcga tcacgtagac gaaaacgccc aggaagggca ggaggatgca gaacaccaac 1140
cagccggcct tcgcccagcc actcagtccg tcgtcccgga agatgtcggt gacgacgcgg 1200
aagagcagga cgaaccacat gatccacagg aagatcatca gcatcgtcca gaaggcaccc 1260
agcagtgggt agtcgtacgc caggtaggtc tgtgcactca tgtccgtcct ccgtcctccg 1320
gggcgcggcc cggcggccct cgttccgtac tgacatcagg gtggtcacgg gtcccaccgg 1380
tcggcatcac ccggcacggg tgagtggggc gccgaggccg tcgtggtcag gcccgggaca 1440
ccggtgtgac cctggtggaa ggacgcgtcc cgtggggcac gcaccgccgg ccgagggcga 1500
ccaccgcctc ggtcagtccg agcaggccca gccacaggcc gagaagtcgg gtcagggcac 1560
gggccgactc ggcgggcagc gcgaggacga cgattccggc gacgtcgacg gccagcgggt 1620
tgcgcaggcc cagcactccg gccggggcgc ccggcaccag cgtggcgagg gccgatgcca 1680
tgagccaggt ccaggaaccc ccaagcctgg cgaggacgtg cgccggatcg ctcaatgctc 1740
cggtgaccgc cccgcccgac ccgtctccct tgtcggcagg ttccgccgca tcacgcggaa 1800
cggagatggc tcccctgtgg atcgggcggc cgctgcgggg ccgcccggtt ggtcggtcgg 1860
tgagcgccgg actccccctt cagctcttcc agggtcgggg tcgacaccga ggtcctggat 1920
cacccgtcag gggtgatccg ggcatgccgt cgtggcggtg aggtgggata cgggaacgat 1980
cggcccacgg gggaccggac gagacgaaga gacgtgagat gagcgatacg aactcgggcg 2040
gcgggcgcca ggccgcttcc ggaccggccc cacgtggccg actccctttc cgccggcgcg 2100
tggccctggt cgctgtcgca cgtcccctga tcgtcacggt cggtctcgtc accgcctact 2160
acctgcttcc cctggacgag agactcagcg ccggcaccct ggtgtcgctg gtgtgcggac 2220
tgctcgcagt ccttctggtg ttctgctggg aggtgcgggc catcacgcgc tccccgcatc 2280
cgcgtctgag agcgatcgag ggcctggccg ccacgctggt gctgttcctg gtcctcttcg 2340
ccggctccta ctacctgctg ggtcgctccg cgcccggctc cttcagcgag ccgctgaaca 2400
ggacggacgc gctgtacttc actctgacca cgttcgccac cgtcggcttc ggggacatca 2460
ccgcacgctc cgagaccggg cggatcctca cgatggcgca gatgacggga gggctactgc 2520
tcgtcggagt cgccgcccgg gtgctggcga gcgcagtgca ggcggggctg caccgacagg 2580
gccggggacc ggcggcatcg ccacgctccg gtgctgcgga ggagccggag gccggaccat 2640
gaccgtaccc ggtggcttca ccgcctccct gccgccggcc gagcgagccg cgtacggcag 2700
gaaggcccgt aaaagggcct cacgttcgtg ccacggctgg tacgagccgg ggcagcggcg 2760
gcctgacccc gtcgacctgc tggagcgcca gtccggcgag cgtgtcccgg cactcgtgcc 2820
catccgctac ggtcgcatgc tggagtcgcc gttccgcttc taccgcggtg cggcagcgat 2880
catggcggcg gacctggcac ccctgcccag cagcggactc caggtgcaat tgtgcgggga 2940
cgcgcacccg ttgaacttcc ggctcctggc ctcaccggag cgccggctgg tcttcgacat 3000
caacgacttc gacgagacgc tgcccggccc cttcgagtgg gacgtcaaac ggctggcggc 3060
cggattcgtg atcgcggccc ggtcgaacgg cttctcgtcc aaggaacaga accgcaccgt 3120
tcgggcctgt gtgcgggcct accgggagcg catgagggag ttcgccgtca tgccgaccct 3180
ggacatctgg tacgcccagg acgacgccga ccacgtacgg caactgctgg ctacggaggc 3240
cagaggagaa gctgagcagc ggctcaggga cgcggctgcg aaggcccgca cacgcaccca 3300
catgagggcg ttcgcgaagc tcacccgcgt cacggccgag ggccggcgca tcacccccga 3360
cccgccgctg atcaccccac tcggcgatct gctcaccgac ccggccgaag ccggccggga 3420
ggaggaactg cggtccgtcg tgaacggcta cgcacggtcc ctgccgcccg agcgccggca 3480
cctgctgcgt cactaccggc ttgtggacat ggcgcgcaag gtggtcggcg tcggcagtgt 3540
cggcacccgc tgctgggtac tgcttctgct cggcagggac gacgacgatc ctctgctgct 3600
ccaggccaag gaagcctcgg aatcggtgct ggcggcccac acgggcggcg aacgctacga 3660
ccatcagggc cgcagggtcg tggccggcca gcgtctgatc cagaccaccg gtgacatctt 3720
tctcggctgg gcgcgcgtca ccggcttcga cggaaaggcc cgggacttct acgtgcgtca 3780
actgtgggac tggaagggcg tcgcgcggcc ggaaaccatg gggcccgacc tgctctccct 3840
cttcgcccgg ctgtgcggtg cctgcctggc gagggcccac gcccgttccg gtgaccccgt 3900
cgcgctcgcc gcgtacctgg gcggcagcga ccgcttcgac ggcgcgctca ccgagttcgc 3960
ccagtcctac gccgatcaga atgaacgcga ccacgaagct ctgctggcgg cctgccgctc 4020
cggcagggtc acggccgccc gtttgtgagg ccgacccggg aacggccggc gggctggcac 4080
acaccgccgc cggtcggcgt cattccggaa gctgccgcat ctccaggacg cgcaggccca 4140
gcgactggca gcgggtgagc aacccgtaca gatgggcctc gtcgatcacc gtgccgaaca 4200
gcacggtctg gccggacatg acgacgtgct ccagctccgg gaacgcgttg gccagcgtcc 4260
gtgacaggtg tccctcgacg cggatctcgt agcgcacgag cggtcctttc accgtaggag 4320
ctcgggacac cgcccggggc tccgggtcgg acggtgctct tggtgacgag cctgcgcctc 4380
gtcgccctcc ggtgccctca cccagcacag gtgactccaa ccgcagtgtc agtgcctttc 4440
agtgcgtcac tgtgatcttg acgacgacga tcaccaggcc gagcagtacg ttgaccgtcg 4500
cggtgacggc caccagtcgt cgcgaggcgc ccgcgcggtg cgccgcggcg acggaccagc 4560
ccacctgacc ggcgacggcg acggacagcg ccagccacag ggtgcccggg acgtccagcc 4620
ccagtacggg gctgacggcg atggccgcgg ccggaggcac ggcggccttg acgatcggcc 4680
actcctcgcg gcacacacgc agaatcaccc gccggtccgg agtgtgccgc gcgagacgcg 4740
ctccgaacag ttcggcgtgg acgtgagcga tccagaacac caagctggtg agcaacagca 4800
gaagaaccag ttcggcgcgg gggaacgagc ccagggtgcc ggcgccgatc acgacggagg 4860
ctgcgagcat 4870
//
Genbank file format:
====================
LOCUS SCU49845 5028 bp DNA PLN 21-JUN-1999
DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION U49845
VERSION U49845.1 GI:1293613
KEYWORDS .
SOURCE Saccharomyces cerevisiae (baker's yeast)
ORGANISM Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
Saccharomycetales; Saccharomycetaceae; Saccharomyces.
REFERENCE 1 (bases 1 to 5028)
AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.
TITLE Cloning and sequence of REV7, a gene whose function is required for
DNA damage-induced mutagenesis in Saccharomyces cerevisiae
JOURNAL Yeast 10 (11), 1503-1509 (1994)
PUBMED 7871890
REFERENCE 2 (bases 1 to 5028)
AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M.
TITLE Selection of axial growth sites in yeast requires Axl2p, a novel
plasma membrane glycoprotein
JOURNAL Genes Dev. 10 (7), 777-793 (1996)
PUBMED 8846915
REFERENCE 3 (bases 1 to 5028)
AUTHORS Roemer,T.
TITLE Direct Submission
JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New
Haven, CT, USA
FEATURES Location/Qualifiers
source 1..5028
/organism="Saccharomyces cerevisiae"
/db_xref="taxon:4932"
/chromosome="IX"
/map="9"
CDS <1..206
/codon_start=3
/product="TCP1-beta"
/protein_id="AAA98665.1"
/db_xref="GI:1293614"
/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
AEVLLRVDNIIRARPRTANRQHM"
gene 687..3158
/gene="AXL2"
CDS 687..3158
/gene="AXL2"
/note="plasma membrane glycoprotein"
/codon_start=1
/function="required for axial budding pattern of S.
cerevisiae"
/product="Axl2p"
/protein_id="AAA98666.1"
/db_xref="GI:1293615"
/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF
TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN
VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE
VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE
TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV
YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG
DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ
DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA
NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA
CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN
NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ
SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS
YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK
HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL
VDFSNKSNVNVGQVKDIHGRIPEML"
gene complement(3300..4037)
/gene="REV7"
CDS complement(3300..4037)
/gene="REV7"
/codon_start=1
/product="Rev7p"
/protein_id="AAA98667.1"
/db_xref="GI:1293616"
/translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ
FVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVD
KDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNR
RVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEK
LISGDDKILNGVYSQYEEGESIFGSLF"
ORIGIN
1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg
61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct
121 ctgcatctga agccgctgaa gttctactaa gggtggataa catcatccgt gcaagaccaa
181 gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg
241 ccacactgtc attattataa ttagaaacag aacgcaaaaa ttatccacta tataattcaa
301 agacgcgaaa aaaaaagaac aacgcgtcat agaacttttg gcaattcgcg tcacaaataa
361 attttggcaa cttatgtttc ctcttcgagc agtactcgag ccctgtctca agaatgtaat
421 aatacccatc gtaggtatgg ttaaagatag catctccaca acctcaaagc tccttgccga
481 gagtcgccct cctttgtcga gtaattttca cttttcatat gagaacttat tttcttattc
541 tttactctca catcctgtag tgattgacac tgcaacagcc accatcacta gaagaacaga
601 acaattactt aatagaaaaa ttatatcttc ctcgaaacga tttcctgctt ccaacatcta
661 cgtatatcaa gaagcattca cttaccatga cacagcttca gatttcatta ttgctgacag
721 ctactatatc actactccat ctagtagtgg ccacgcccta tgaggcatat cctatcggaa
781 aacaataccc cccagtggca agagtcaatg aatcgtttac atttcaaatt tccaatgata
841 cctataaatc gtctgtagac aagacagctc aaataacata caattgcttc gacttaccga
901 gctggctttc gtttgactct agttctagaa cgttctcagg tgaaccttct tctgacttac
961 tatctgatgc gaacaccacg ttgtatttca atgtaatact cgagggtacg gactctgccg
1021 acagcacgtc tttgaacaat acataccaat ttgttgttac aaaccgtcca tccatctcgc
1081 tatcgtcaga tttcaatcta ttggcgttgt taaaaaacta tggttatact aacggcaaaa
1141 acgctctgaa actagatcct aatgaagtct tcaacgtgac ttttgaccgt tcaatgttca
1201 ctaacgaaga atccattgtg tcgtattacg gacgttctca gttgtataat gcgccgttac
1261 ccaattggct gttcttcgat tctggcgagt tgaagtttac tgggacggca ccggtgataa
1321 actcggcgat tgctccagaa acaagctaca gttttgtcat catcgctaca gacattgaag
1381 gattttctgc cgttgaggta gaattcgaat tagtcatcgg ggctcaccag ttaactacct
1441 ctattcaaaa tagtttgata atcaacgtta ctgacacagg taacgtttca tatgacttac
1501 ctctaaacta tgtttatctc gatgacgatc ctatttcttc tgataaattg ggttctataa
1561 acttattgga tgctccagac tgggtggcat tagataatgc taccatttcc gggtctgtcc
1621 cagatgaatt actcggtaag aactccaatc ctgccaattt ttctgtgtcc atttatgata
1681 cttatggtga tgtgatttat ttcaacttcg aagttgtctc cacaacggat ttgtttgcca
1741 ttagttctct tcccaatatt aacgctacaa ggggtgaatg gttctcctac tattttttgc
1801 cttctcagtt tacagactac gtgaatacaa acgtttcatt agagtttact aattcaagcc
1861 aagaccatga ctgggtgaaa ttccaatcat ctaatttaac attagctgga gaagtgccca
1921 agaatttcga caagctttca ttaggtttga aagcgaacca aggttcacaa tctcaagagc
1981 tatattttaa catcattggc atggattcaa agataactca ctcaaaccac agtgcgaatg
2041 caacgtccac aagaagttct caccactcca cctcaacaag ttcttacaca tcttctactt
2101 acactgcaaa aatttcttct acctccgctg ctgctacttc ttctgctcca gcagcgctgc
2161 cagcagccaa taaaacttca tctcacaata aaaaagcagt agcaattgcg tgcggtgttg
2221 ctatcccatt aggcgttatc ctagtagctc tcatttgctt cctaatattc tggagacgca
2281 gaagggaaaa tccagacgat gaaaacttac cgcatgctat tagtggacct gatttgaata
2341 atcctgcaaa taaaccaaat caagaaaacg ctacaccttt gaacaacccc tttgatgatg
2401 atgcttcctc gtacgatgat acttcaatag caagaagatt ggctgctttg aacactttga
2461 aattggataa ccactctgcc actgaatctg atatttccag cgtggatgaa aagagagatt
2521 ctctatcagg tatgaataca tacaatgatc agttccaatc ccaaagtaaa gaagaattat
2581 tagcaaaacc cccagtacag cctccagaga gcccgttctt tgacccacag aataggtctt
2641 cttctgtgta tatggatagt gaaccagcag taaataaatc ctggcgatat actggcaacc
2701 tgtcaccagt ctctgatatt gtcagagaca gttacggatc acaaaaaact gttgatacag
2761 aaaaactttt cgatttagaa gcaccagaga aggaaaaacg tacgtcaagg gatgtcacta
2821 tgtcttcact ggacccttgg aacagcaata ttagcccttc tcccgtaaga aaatcagtaa
2881 caccatcacc atataacgta acgaagcatc gtaaccgcca cttacaaaat attcaagact
2941 ctcaaagcgg taaaaacgga atcactccca caacaatgtc aacttcatct tctgacgatt
3001 ttgttccggt taaagatggt gaaaattttt gctgggtcca tagcatggaa ccagacagaa
3061 gaccaagtaa gaaaaggtta gtagattttt caaataagag taatgtcaat gttggtcaag
3121 ttaaggacat tcacggacgc atcccagaaa tgctgtgatt atacgcaacg atattttgct
3181 taattttatt ttcctgtttt attttttatt agtggtttac agatacccta tattttattt
3241 agtttttata cttagagaca tttaatttta attccattct tcaaatttca tttttgcact
3301 taaaacaaag atccaaaaat gctctcgccc tcttcatatt gagaatacac tccattcaaa
3361 attttgtcgt caccgctgat taatttttca ctaaactgat gaataatcaa aggccccacg
3421 tcagaaccga ctaaagaagt gagttttatt ttaggaggtt gaaaaccatt attgtctggt
3481 aaattttcat cttcttgaca tttaacccag tttgaatccc tttcaatttc tgctttttcc
3541 tccaaactat cgaccctcct gtttctgtcc aacttatgtc ctagttccaa ttcgatcgca
3601 ttaataactg cttcaaatgt tattgtgtca tcgttgactt taggtaattt ctccaaatgc
3661 ataatcaaac tatttaagga agatcggaat tcgtcgaaca cttcagtttc cgtaatgatc
3721 tgatcgtctt tatccacatg ttgtaattca ctaaaatcta aaacgtattt ttcaatgcat
3781 aaatcgttct ttttattaat aatgcagatg gaaaatctgt aaacgtgcgt taatttagaa
3841 agaacatcca gtataagttc ttctatatag tcaattaaag caggatgcct attaatggga
3901 acgaactgcg gcaagttgaa tgactggtaa gtagtgtagt cgaatgactg aggtgggtat
3961 acatttctat aaaataaaat caaattaatg tagcatttta agtataccct cagccacttc
4021 tctacccatc tattcataaa gctgacgcaa cgattactat tttttttttc ttcttggatc
4081 tcagtcgtcg caaaaacgta taccttcttt ttccgacctt ttttttagct ttctggaaaa
4141 gtttatatta gttaaacagg gtctagtctt agtgtgaaag ctagtggttt cgattgactg
4201 atattaagaa agtggaaatt aaattagtag tgtagacgta tatgcatatg tatttctcgc
4261 ctgtttatgt ttctacgtac ttttgattta tagcaagggg aaaagaaata catactattt
4321 tttggtaaag gtgaaagcat aatgtaaaag ctagaataaa atggacgaaa taaagagagg
4381 cttagttcat cttttttcca aaaagcaccc aatgataata actaaaatga aaaggatttg
4441 ccatctgtca gcaacatcag ttgtgtgagc aataataaaa tcatcacctc cgttgccttt
4501 agcgcgtttg tcgtttgtat cttccgtaat tttagtctta tcaatgggaa tcataaattt
4561 tccaatgaat tagcaatttc gtccaattct ttttgagctt cttcatattt gctttggaat
4621 tcttcgcact tcttttccca ttcatctctt tcttcttcca aagcaacgat ccttctaccc
4681 atttgctcag agttcaaatc ggcctctttc agtttatcca ttgcttcctt cagtttggct
4741 tcactgtctt ctagctgttg ttctagatcc tggtttttct tggtgtagtt ctcattatta
4801 gatctcaagt tattggagtc ttcagccaat tgctttgtat cagacaattg actctctaac
4861 ttctccactt cactgtcgag ttgctcgttt ttagcggaca aagatttaat ctcgttttct
4921 ttttcagtgt tagattgctc taattctttg agctgttctc tcagctcctc atatttttct
4981 tgccatgact cagattctaa ttttaagcta ttcaatttct ctttgatc
//
Legal values for the division code include:
PRI - primate sequences
ROD - rodent sequences
MAM - other mammalian sequences
VRT - other vertebrate sequences
INV - invertebrate sequences
PLN - plant, fungal, and algal sequences
BCT - bacterial sequences
VRL - viral sequences
PHG - bacteriophage sequences
SYN - synthetic sequences
UNA - unannotated sequences
EST - EST sequences (Expressed Sequence Tags)
PAT - patent sequences
STS - STS sequences (Sequence Tagged Sites)
GSS - GSS sequences (Genome Survey Sequences)
HTG - HTGS sequences (High Throughput Genomic sequences)
HTC - HTC sequences (High Throughput cDNA sequences)
ENV - Environmental sampling sequences
CON - Constructed sequences
TSA - Transcriptome Shotgun Assembly sequences
more info on: ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt
http://www.insdc.org/documents/feature-table
Feature key examples
====================
Key Description
CDS Protein-coding sequence
RBS ribosome binding site
rep_origin Origin of replication
protein_bind Protein binding site on DNA
tRNA mature transfer RNA
Qualifier examples:
===================
Key Location/Qualifiers
source 1..1509
/organism="Mus musculus"
/strain="CD1"
/mol_type="genomic DNA"
promoter <1..9
/gene="ubc42"
mRNA join(10..567,789..1320)
/gene="ubc42"
CDS join(54..567,789..1254)
/gene="ubc42"
/product="ubiquitin conjugating enzyme"
/function="cell division control"
Location examples:
==================
The following is a list of common location descriptors with their meanings:
Location Description
467 Points to a single base in the presented sequence
340..565 Points to a continuous range of bases bounded by and
including the starting and ending bases
<345..500 Indicates that the exact lower boundary point of a feature
is unknown. The location begins at some base previous to
the first base specified (which need not be contained in
the presented sequence) and continues to and includes the
ending base
<1..888 The feature starts before the first sequenced base and
continues to and includes base 888
1..>888 The feature starts at the first sequenced base and
continues beyond base 888
102.110 Indicates that the exact location is unknown but that it is
one of the bases between bases 102 and 110, inclusive
123^124 Points to a site between bases 123 and 124
join(12..78,134..202) Regions 12 to 78 and 134 to 202 should be joined to form
one contiguous sequence
complement(34..126) Start at the base complementary to 126 and finish at the
base complementary to base 34 (the feature is on the strand
complementary to the presented strand)
complement(join(2691..4571,4918..5163))
Joins regions 2691 to 4571 and 4918 to 5163, then
complements the joined segments (the feature is on the
strand complementary to the presented strand)
join(complement(4918..5163),complement(2691..4571))
Complements regions 4918 to 5163 and 2691 to 4571, then
joins the complemented segments (the feature is on the
strand complementary to the presented strand)
J00194.1:100..202 Points to bases 100 to 202, inclusive, in the entry (in
this database) with primary accession number 'J00194'
join(1..100,J00194.1:100..202)
Joins region 1..100 of the existing entry with the region
100..202 of remote entry J00194
Listing
Symbol Meaning
------ -------
a a; adenine
c c; cytosine
g g; guanine
t t; thymine in DNA; uracil in RNA
m a or c
r a or g
w a or t
s c or g
y c or t
k g or t
v a or c or g; not t
h a or c or t; not g
d a or g or t; not c
b c or g or t; not a
n a or c or g or t
Modified base abbreviations:
============================
Authority Sprinzl, M. and Gauss, D.H.
Reference Sprinzl, M. and Gauss, D.H. Nucl Acid Res 10, r1 (1982).
(note that in Cornish_Bowden, A. Nucl Acid Res 13, 3021-3030
(1985) the IUPAC-IUB declined to recommend a set of
abbreviations for modified nucleotides)
Contact NCBI
Scope /mod_base
Abbreviation Modified base description
------------ -------------------------
ac4c 4-acetylcytidine
chm5u 5-(carboxyhydroxylmethyl)uridine
cm 2'-O-methylcytidine
cmnm5s2u 5-carboxymethylaminomethyl-2-thiouridine
cmnm5u 5-carboxymethylaminomethyluridine
d dihydrouridine
fm 2'-O-methylpseudouridine
gal q beta,D-galactosylqueosine
gm 2'-O-methylguanosine
i inosine
i6a N6-isopentenyladenosine
m1a 1-methyladenosine
m1f 1-methylpseudouridine
m1g 1-methylguanosine
m1i 1-methylinosine
m22g 2,2-dimethylguanosine
m2a 2-methyladenosine
m2g 2-methylguanosine
m3c 3-methylcytidine
m5c 5-methylcytidine
m6a N6-methyladenosine
m7g 7-methylguanosine
mam5u 5-methylaminomethyluridine
mam5s2u 5-methoxyaminomethyl-2-thiouridine
man q beta,D-mannosylqueosine
mcm5s2u 5-methoxycarbonylmethyl-2-thiouridine
mcm5u 5-methoxycarbonylmethyluridine
mo5u 5-methoxyuridine
ms2i6a 2-methylthio-N6-isopentenyladenosine
ms2t6a N-((9-beta-D-ribofuranosyl-2-methyltiopurine-6-yl)car
bamoyl)threonine
mt6a N-((9-beta-D-ribofuranosylpurine-6-yl)N-methyl-carbam
oyl)threonine
mv uridine-5-oxyacetic acid-methylester
o5u uridine-5-oxyacetic acid (v)
osyw wybutoxosine
p pseudouridine
q queosine
s2c 2-thiocytidine
s2t 5-methyl-2-thiouridine
s2u 2-thiouridine
s4u 4-thiouridine
m5u 5-methyluridine
t6a N-((9-beta-D-ribofuranosylpurine-6-yl)carbamoyl)threo
nine
tm 2'-O-methyl-5-methyluridine
um 2'-O-methyluridine
yw wybutosine
x 3-(3-amino-3-carboxypropyl)uridine, (acp3)u
OTHER (requires /note= qualifier)
Amino acid abbreviations:
=========================
Authority IUPAC-IUB Joint Commission on Biochemical Nomenclature.
Reference IUPAC-IUB Joint Commission on Biochemical Nomenclature.
Nomenclature and Symbolism for Amino Acids and
Peptides.
Eur. J. Biochem. 138:9-37(1984).
IUPAC-IUBMB JCBN Newsletter, 1999
http://www.chem.qmul.ac.uk/iubmb/newsletter/1999/item3.html
Scope /anticodon, /transl_except
Contact EMBL-EBI
Listing (note that the abbreviations are legal values for amino acids, not the full names)
Abbreviation Amino acid name
------------ ---------------
Ala A Alanine
Arg R Arginine
Asn N Asparagine
Asp D Aspartic acid (Aspartate)
Cys C Cysteine
Gln Q Glutamine
Glu E Glutamic acid (Glutamate)
Gly G Glycine
His H Histidine
Ile I Isoleucine
Leu L Leucine
Lys K Lysine
Met M Methionine
Phe F Phenylalanine
Pro P Proline
Pyl O Pyrrolysine
Ser S Serine
Sec U Selenocysteine
Thr T Threonine
Trp W Tryptophan
Tyr Y Tyrosine
Val V Valine
Asx B Aspartic acid or Asparagine
Glx Z Glutamine or Glutamic acid.
Xaa X Any amino acid.
Xle J Leucine or Isoleucine
TERM termination codon
Modified and unusual Amino Acids:
=================================
Abbreviation Amino acid
------------ ---------
Aad 2-Aminoadipic acid
bAad 3-Aminoadipic acid
bAla beta-Alanine, beta-Aminoproprionic acid
Abu 2-Aminobutyric acid
4Abu 4-Aminobutyric acid, piperidinic acid
Acp 6-Aminocaproic acid
Ahe 2-Aminoheptanoic acid
Aib 2-Aminoisobutyric acid
bAib 3-Aminoisobutyric acid
Apm 2-Aminopimelic acid
Dbu 2,4-Diaminobutyric acid
Des Desmosine
Dpm 2,2'-Diaminopimelic acid
Dpr 2,3-Diaminoproprionic acid
EtGly N-Ethylglycine
EtAsn N-Ethylasparagine
Hyl Hydroxylysine
aHyl allo-Hydroxylysine
3Hyp 3-Hydroxyproline
4Hyp 4-Hydroxyproline
Ide Isodesmosine
aIle allo-Isoleucine
MeGly N-Methylglycine, sarcosine
MeIle N-Methylisoleucine
MeLys 6-N-Methyllysine
MeVal N-Methylvaline
Nva Norvaline
Nle Norleucine
Orn Ornithine
OTHER (requires /note=)
Records from the RefSeq database of reference sequences have a different accession number format that begins with two letters followed by an underscore bar and six or more digits, for example:
NT_123456 constructed genomic contigs
NM_123456 mRNAs
NP_123456 proteins
NC_123456 chromosomes
Feature key:
============
assembly_gap:
gap between two components of a genome or transcriptome assembly;
attenuator:
1) region of DNA at which regulation of termination of transcription occurs, which controls the expression of some bacterial operons;
2) sequence segment located between the promoter and the first structural gene that causes partial termination of transcription
C_region:
constant region of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains; includes one or more exons depending on the particular chain
CAAT_signal:
CAAT box; part of a conserved sequence located about 75bp up-stream of the start point of eukaryotic transcription units which may be involved in RNA polymerase binding; consensus=GG(C or T)CAATCT
CDS:
coding sequence; sequence of nucleotides that corresponds with the sequence of amino acids in a protein (location includes stop codon); feature includes amino acid conceptual translation.
centromere:
region of biological interest identified as a centromere and which has been experimentally characterized;
D-loop:
displacement loop; a region within mitochondrial DNA in which a short stretch of RNA is paired with one strand of DNA, displacing the original partner DNA strand in this region; also used to describe the displacement of a region of one strand of duplex DNA by a single stranded invader in the reaction catalyzed by RecA protein
D_segment:
Diversity segment of immunoglobulin heavy chain, and T-cell receptor beta chain;
enhancer:
a cis-acting sequence that increases the utilization of (some) eukaryotic promoters, and can function in either orientation and in any location (upstream or downstream) relative to the promoter;
exon:
region of genome that codes for portion of spliced mRNA,rRNA and tRNA; may contain 5'UTR, all CDSs and 3' UTR;
gap:
gap in the sequence
GC_signal:
GC box; a conserved GC-rich region located upstream of the start point of eukaryotic transcription units which may occur in multiple copies or in either orientation;consensus=GGGCGG;
gene:
region of biological interest identified as a gene and for which a name has been assigned;
iDNA:
intervening DNA; DNA which is eliminated through any of several kinds of recombination;
intron:
a segment of DNA that is transcribed, but removed from within the transcript by splicing together the sequences(exons) on either side of it;
J_segment:
joining segment of immunoglobulin light and heavy chains, and T-cell receptor alpha, beta, and gamma chains;
LTR:
long terminal repeat, a sequence directly repeated at both ends of a defined sequence, of the sort typically found in retroviruses;
mat_peptide:
mature peptide or protein coding sequence; coding sequence for the mature or final peptide or protein product following post-translational modification; the location does not include the stop codon (unlike the corresponding CDS);
misc_binding:
site in nucleic acid which covalently or non-covalently binds another moiety that cannot be described by any other binding key (primer_bind or protein_bind);
misc_difference:
feature sequence is different from that presented in the entry and cannot be described by any other Difference key (unsure, old_sequence, variation, or modified_base);
misc_feature:
region of biological interest which cannot be described by any other feature key; a new or rare feature;
misc_recomb:
site of any generalized, site-specific or replicative recombination event where there is a breakage and reunion of duplex DNA that cannot be described by other recombination keys or qualifiers of source key (/proviral);
misc_RNA:
any transcript or RNA product that cannot be defined by other RNA keys (prim_transcript, precursor_RNA, mRNA, 5'UTR, 3'UTR, exon, CDS, sig_peptide, transit_peptide,mat_peptide, intron, polyA_site, ncRNA, rRNA and tRNA);
misc_signal:
any region containing a signal controlling or altering gene function or expression that cannot be described by other signal keys (promoter, CAAT_signal, TATA_signal,-35_signal, -10_signal, GC_signal, RBS, polyA_signal, enhancer, attenuator, terminator, and rep_origin).
misc_structure:
any secondary or tertiary nucleotide structure or conformation that cannot be described by other Structure keys (stem_loop and D-loop);
mobile_element:
region of genome containing mobile elements;
modified_base:
the indicated nucleotide is a modified nucleotide and should be substituted for by the indicated molecule (given in the mod_base qualifier value)
mRNA:
messenger RNA; includes 5'untranslated region (5'UTR), coding sequences (CDS, exon) and 3'untranslated region (3'UTR);
ncRNA:
a non-protein-coding gene, other than ribosomal RNA and transfer RNA, the functional molecule of which is the RNA transcript;
N_region:
extra nucleotides inserted between rearranged immunoglobulin segments.
old_sequence:
the presented sequence revises a previous version of the sequence at this location;
operon:
region containing polycistronic transcript including a cluster of genes that are under the control of the same regulatory sequences/promotor and in the same biological pathway
oriT:
origin of transfer; region of a DNA molecule where transfer is initiated during the process of conjugation or mobilization
polyA_signal:
recognition region necessary for endonuclease cleavage of an RNA transcript that is followed by polyadenylation; consensus=AATAAA;
polyA_site:
site on an RNA transcript to which will be added adenine residues by post-transcriptional polyadenylation;
precursor_RNA:
any RNA species that is not yet the mature RNA product; may include 5' untranslated region (5'UTR), coding sequences (CDS, exon), intervening sequences (intron) and 3' untranslated region (3'UTR);
prim_transcript:
primary (initial, unprocessed) transcript; includes 5' untranslated region (5'UTR), coding sequences (CDS, exon), intervening sequences (intron) and 3' untranslated region (3'UTR);
primer_bind:
non-covalent primer binding site for initiation of replication, transcription, or reverse transcription; includes site(s) for synthetic e.g., PCR primer elements;
promoter:
region on a DNA molecule involved in RNA polymerase binding to initiate transcription;
protein_bind:
non-covalent protein binding site on nucleic acid;
RBS:
ribosome binding site;
How to check if biopython is installed:
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from __future__ import print_function
>>> import Bio
>>> print(Bio.__version__)
1.62
Bio.Seq Seq class:
==================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> my_seq=Seq("AGTCTAGTCT")
>>> my_seq
Seq('AGTCTAGTCT', Alphabet())
>>> print(my_seq)
AGTCTAGTCT
>>> my_seq.alphabet
Alphabet()
>>> my_seq
Seq('AGTCTAGTCT', Alphabet())
>>> my_seq.complement()
Seq('TCAGATCAGA', Alphabet())
>>> my_seq.reverse_complement()
Seq('AGACTAGACT', Alphabet())
>>>
SeqRecord holds a sequence as Seq object with additional annotation, including and identifier, name, and description
Bio.SeqIO is for reading and writing sequence file formats and works with SeqRecord object
download files from:
http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide
[uzi@quince-srv2 ~/biopython]$ grep -i ">" ls_orchid.fasta
>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765654|emb|Z78529.1|CLZ78529 C.lichiangense 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765652|emb|Z78527.1|CYZ78527 C.yatabeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765651|emb|Z78526.1|CGZ78526 C.guttatum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765650|emb|Z78525.1|CAZ78525 C.acaule 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765649|emb|Z78524.1|CFZ78524 C.formosanum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765648|emb|Z78523.1|CHZ78523 C.himalaicum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765647|emb|Z78522.1|CMZ78522 C.macranthum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765646|emb|Z78521.1|CCZ78521 C.calceolus 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765645|emb|Z78520.1|CSZ78520 C.segawai 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765644|emb|Z78519.1|CPZ78519 C.pubescens 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765643|emb|Z78518.1|CRZ78518 C.reginae 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765642|emb|Z78517.1|CFZ78517 C.flavum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765641|emb|Z78516.1|CPZ78516 C.passerinum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765640|emb|Z78515.1|MXZ78515 M.xerophyticum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765639|emb|Z78514.1|PSZ78514 P.schlimii 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765638|emb|Z78513.1|PBZ78513 P.besseae 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765637|emb|Z78512.1|PWZ78512 P.wallisii 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765636|emb|Z78511.1|PEZ78511 P.exstaminodium 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765635|emb|Z78510.1|PCZ78510 P.caricinum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765634|emb|Z78509.1|PPZ78509 P.pearcei 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765633|emb|Z78508.1|PLZ78508 P.longifolium 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765632|emb|Z78507.1|PLZ78507 P.lindenii 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765631|emb|Z78506.1|PLZ78506 P.lindleyanum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765630|emb|Z78505.1|PSZ78505 P.sargentianum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765629|emb|Z78504.1|PKZ78504 P.kaiteurum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765628|emb|Z78503.1|PCZ78503 P.czerwiakowianum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765627|emb|Z78502.1|PBZ78502 P.boissierianum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765626|emb|Z78501.1|PCZ78501 P.caudatum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765625|emb|Z78500.1|PWZ78500 P.warszewiczianum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765624|emb|Z78499.1|PMZ78499 P.micranthum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765623|emb|Z78498.1|PMZ78498 P.malipoense 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765622|emb|Z78497.1|PDZ78497 P.delenatii 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765621|emb|Z78496.1|PAZ78496 P.armeniacum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765620|emb|Z78495.1|PEZ78495 P.emersonii 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765619|emb|Z78494.1|PNZ78494 P.niveum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765618|emb|Z78493.1|PGZ78493 P.godefroyae 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765617|emb|Z78492.1|PBZ78492 P.bellatulum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765616|emb|Z78491.1|PCZ78491 P.concolor 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765615|emb|Z78490.1|PFZ78490 P.fairrieanum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765614|emb|Z78489.1|PDZ78489 P.druryi 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765613|emb|Z78488.1|PTZ78488 P.tigrinum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765612|emb|Z78487.1|PHZ78487 P.hirsutissimum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765611|emb|Z78486.1|PBZ78486 P.barbigerum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765610|emb|Z78485.1|PHZ78485 P.henryanum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765609|emb|Z78484.1|PCZ78484 P.charlesworthii 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765608|emb|Z78483.1|PVZ78483 P.villosum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765607|emb|Z78482.1|PEZ78482 P.exul 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765606|emb|Z78481.1|PIZ78481 P.insigne 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765605|emb|Z78480.1|PGZ78480 P.gratrixianum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765604|emb|Z78479.1|PPZ78479 P.primulinum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765603|emb|Z78478.1|PVZ78478 P.victoria 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765602|emb|Z78477.1|PVZ78477 P.victoria 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765601|emb|Z78476.1|PGZ78476 P.glaucophyllum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765600|emb|Z78475.1|PSZ78475 P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765599|emb|Z78474.1|PKZ78474 P.kolopakingii 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765598|emb|Z78473.1|PSZ78473 P.sanderianum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765597|emb|Z78472.1|PLZ78472 P.lowii 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765596|emb|Z78471.1|PDZ78471 P.dianthum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765595|emb|Z78470.1|PPZ78470 P.parishii 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765594|emb|Z78469.1|PHZ78469 P.haynaldianum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765593|emb|Z78468.1|PAZ78468 P.adductum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765592|emb|Z78467.1|PSZ78467 P.stonei 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765591|emb|Z78466.1|PPZ78466 P.philippinense 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765590|emb|Z78465.1|PRZ78465 P.rothschildianum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765589|emb|Z78464.1|PGZ78464 P.glanduliferum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765588|emb|Z78463.1|PGZ78463 P.glanduliferum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765587|emb|Z78462.1|PSZ78462 P.sukhakulii 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765586|emb|Z78461.1|PWZ78461 P.wardii 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765585|emb|Z78460.1|PCZ78460 P.ciliolare 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765584|emb|Z78459.1|PDZ78459 P.dayanum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765583|emb|Z78458.1|PHZ78458 P.hennisianum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765582|emb|Z78457.1|PCZ78457 P.callosum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765581|emb|Z78456.1|PTZ78456 P.tonsum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765580|emb|Z78455.1|PJZ78455 P.javanicum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765579|emb|Z78454.1|PFZ78454 P.fowliei 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765578|emb|Z78453.1|PSZ78453 P.schoseri 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765577|emb|Z78452.1|PBZ78452 P.bougainvilleanum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765576|emb|Z78451.1|PHZ78451 P.hookerae 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765575|emb|Z78450.1|PPZ78450 P.papuanum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765574|emb|Z78449.1|PMZ78449 P.mastersianum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765573|emb|Z78448.1|PAZ78448 P.argus 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765572|emb|Z78447.1|PVZ78447 P.venustum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765571|emb|Z78446.1|PAZ78446 P.acmodontum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765570|emb|Z78445.1|PUZ78445 P.urbanianum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765569|emb|Z78444.1|PAZ78444 P.appletonianum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765568|emb|Z78443.1|PLZ78443 P.lawrenceanum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765567|emb|Z78442.1|PBZ78442 P.bullenianum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765566|emb|Z78441.1|PSZ78441 P.superbiens 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765565|emb|Z78440.1|PPZ78440 P.purpuratum 5.8S rRNA gene and ITS1 and ITS2 DNA
>gi|2765564|emb|Z78439.1|PBZ78439 P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA
[uzi@quince-srv2 ~/biopython]$ grep -c ">" ls_orchid.fasta
94
Read the FASTA file with Biopython:
===================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> for seq_record in SeqIO.parse("ls_orchid.fasta","fasta"):
... print(seq_record.id)
... print(repr(seq_record.seq))
... print(len(seq_record))
...
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
740
gi|2765657|emb|Z78532.1|CCZ78532
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC', SingleLetterAlphabet())
753
gi|2765656|emb|Z78531.1|CFZ78531
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', SingleLetterAlphabet())
748
Here, biopython defaulted the alphabet to generic SingleLetterAlphabet()
repr function:
==============
>>> x = 'foo'
>>> x
'foo'
So the name x is attached to 'foo' string. When you call for example repr(x) the iterpreter put 'foo' instead of x and then calls repr('foo').
>>> repr(x)
"'foo'"
>>> x.__repr__()
"'foo'"
repr actually calls a magic method __repr__ of x, which gives a string containing the representation of the value 'foo' assigned to x. So it returns 'foo' inside the string "" resulting in "'foo'". The idea of repr is to give a string which contains a series of symbols which we can type in the interpreter and get the same value which was sent as an argument to repr.
enumerate function:
===================
[(i, j) for i, j in enumerate(mylist)]
You need to put i,j inside a tuple for the list comprehension to work. Alternatively, given that enumerate() already returns a tuple, you can return it directly without unpacking it first:
[pair for pair in enumerate(mylist)]
Either way, the result that gets returned is as expected:
> [(0, 'a'), (1, 'b'), (2, 'c'), (3, 'd')]
Read genbank file with biopython:
=================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> for seq_record in SeqIO.parse("ls_orchid.gbk","genbank"):
... print(seq_record.id)
... print(repr(seq_record.seq))
... print(len(seq_record))
...
Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', IUPACAmbiguousDNA())
740
Z78532.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAG...GGC', IUPACAmbiguousDNA())
753
Z78531.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGCAG...TAA', IUPACAmbiguousDNA())
748
This time Bio.SeqIO has been able to choose a sensible alphabet, IUPAC Ambiguous DNA.
IUPAC in detail:
================
Bio.Alphabet.IUPAC --> provides definition for proteins, DNA, and RNA and also provides the ability to customize and extend basic functions
IUPACProtein
ExtendedIUPACProtein (additional elements “U” (or “Sec” for selenocysteine) and “O” (or “Pyl” for pyrrolysine), plus the ambiguous symbols “B” (or “Asx” for asparagine or aspartic acid), “Z” (or “Glx” for glutamine or glutamic acid), “J” (or “Xle” for leucine isoleucine) and “X” (or “Xxx” for an unknown amino acid))
IUPACUnambiguousDNA
IUPACAmbiguousDNA
ExtendedIUPACDNA
IUPACAmbiguousRNA
UPACUnambiguousRNA
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq=Seq("AGTAGATGTA", IUPAC.unambiguous_dna)
>>> my_seq
Seq('AGTAGATGTA', IUPACUnambiguousDNA())
>>> my_seq.alphabet
IUPACUnambiguousDNA()
>>>
for protein, in the above one use IUPAC.protein
Accessing Seq members:
======================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq=Seq("GATCG", IUPAC.unambiguous_dna)
>>> for index, letter in enumerate(my_seq):
... print("%i %s" % (index, letter))
...
0 G
1 A
2 T
3 C
4 G
>>> print(len(my_seq))
5
>>> print(my_seq[0])
G
>>> print(my_seq[2])
T
>>> print(my_seq[-1])
G
>>>
Seq is similar to python strings:
=================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> "AAAA".count("AA")
2
>>> Seq("AAAA").count("AA")
2
>>>
Calculate GC Contents:
======================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq=Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',IUPAC.unambiguous_dna)
>>> len(my_seq)
32
>>> my_seq.count("G")
9
>>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)
46.875
>>>
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> from Bio.SeqUtils import GC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPAC.unambiguous_dna)
>>> GC(my_seq)
46.875
>>>
Do slices with a start, stop and stride (the step size, which defaults to one):
===============================================================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
>>> my_seq[4:12]
Seq('GATGGGCC', IUPACUnambiguousDNA())
>>> my_seq[0::3]
Seq('GCTGTAGTAAG', IUPACUnambiguousDNA())
>>> my_seq[1::3]
Seq('AGGCATGCATC', IUPACUnambiguousDNA())
>>> my_seq[2::3]
Seq('TAGCTAAGAC', IUPACUnambiguousDNA())
>>> my_seq[::-1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())
>>> str(my_seq)
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
>>> print(my_seq)
GATCGATGGGCCTATATAGGATCGAAAATCGC
>>> fasta_format_string=">Name\n%s\n" % my_seq
>>> print(fasta_format_string)
>Name
GATCGATGGGCCTATATAGGATCGAAAATCGC
>>> my_seq.tostring()
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
>>>
Concatenating or adding sequences:
==================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Alphabet import IUPAC
>>> from Bio.Seq import Seq
>>> protein_seq = Seq("EVRNAK", IUPAC.protein)
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> protein_seq + dna_seq
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/Seq.py", line 248, in __add__
% (repr(self.alphabet), repr(other.alphabet)))
TypeError: Incompatible alphabets IUPACProtein() and IUPACUnambiguousDNA()
>>> from Bio.Alphabet import generic_alphabet
>>> protein_seq.alphabet = generic_alphabet
>>> dna_seq.alphabet = generic_alphabet
>>> protein_seq + dna_seq
Seq('EVRNAKACGT', Alphabet())
>>>
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_nucleotide
>>> from Bio.Alphabet import IUPAC
>>> nuc_seq = Seq("GATCGATGC", generic_nucleotide)
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> nuc_seq
Seq('GATCGATGC', NucleotideAlphabet())
>>> dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> nuc_seq + dna_seq
Seq('GATCGATGCACGT', NucleotideAlphabet())
>>>
Changing case:
==============
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> dna_seq = Seq("acgtACGT", generic_dna)
>>> dna_seq
Seq('acgtACGT', DNAAlphabet())
>>> dna_seq.upper()
Seq('ACGTACGT', DNAAlphabet())
>>> dna_seq.lower()
Seq('acgtacgt', DNAAlphabet())
>>> "GTAC" in dna_seq
False
>>> "GTAC" in dna_seq.upper()
True
>>>
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> dna_seq.lower()
Seq('acgt', DNAAlphabet())
>>>
IUPAC alphabets are for upper case sequences only
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna)
>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA())
>>> my_seq.complement()
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA())
>>> my_seq.reverse_complement()
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())
>>> my_seq[::-1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())
>>>
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> protein_seq = Seq("EVRNAK", IUPAC.protein)
>>> protein_seq.complement()
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/Seq.py", line 718, in complement
raise ValueError("Proteins do not have complements!")
ValueError: Proteins do not have complements!
>>>
TRANSCRIPTION & TRANSLATION:
============================
DNA coding strand (aka Crick strand, strand +1)
5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’
|||||||||||||||||||||||||||||||||||||||
3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’
DNA template strand (aka Watson strand, strand −1)
|
Transcription
↓
5’ AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’
Single stranded messenger RNA
The actual biological transcription process works from the template strand, doing a reverse complement (TCAG → CUGA) to give the mRNA.
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna)
>>> coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> template_dna = coding_dna.reverse_complement()
>>> template_dna
Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT', IUPACUnambiguousDNA())
>>> messenger_rna = coding_dna.transcribe()
>>> messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())
>>> messenger_rna.back_transcribe()
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
>>> messenger_rna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(table="Vertebrate Mitochondrial")
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(to_stop=True)
Seq('MAIVMGR', IUPACProtein())
>>> coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))
>>> coding_dna.translate(table=2, to_stop=True)
Seq('MAIVMGRWKGAR', IUPACProtein())
>>> coding_dna.translate(table=2, stop_symbol="@")
Seq('MAIVMGRWKGAR@', HasStopCodon(IUPACProtein(), '@'))
>>> from Bio.Alphabet import generic_dna
>>> gene = Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA" + \
... "GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT" + \
... "AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT" + \
... "TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT" + \
... "AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",
... generic_dna)
>>> gene.translate(table="Bacterial")
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*',
HasStopCodon(ExtendedIUPACProtein(), '*')
>>> gene.translate(table="Bacterial", to_stop=True)
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',
ExtendedIUPACProtein())
>>> gene.translate(table="Bacterial", cds=True)
Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',
ExtendedIUPACProtein())
As you can see, all this does is switch T → U, and adjust the alphabet.
In the bacterial genetic code GTG is a valid start codon, and while it does normally encode Valine, if used as a start codon it should be translated as methionine. This happens if you tell Biopython your sequence is a complete CDS
http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes#SG1
ftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt
http://www.kazusa.or.jp/codon/
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Data import CodonTable
>>> standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
>>> mito_table = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
>>> standard_table = CodonTable.unambiguous_dna_by_id[1]
>>> mito_table = CodonTable.unambiguous_dna_by_id[2]
>>> print(standard_table)
Table 1 Standard, SGC0
| T | C | A | G |
--+---------+---------+---------+---------+--
T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S | TAG Stop| TGG W | G
--+---------+---------+---------+---------+--
C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L(s)| CCG P | CAG Q | CGG R | G
--+---------+---------+---------+---------+--
A | ATT I | ACT T | AAT N | AGT S | T
A | ATC I | ACC T | AAC N | AGC S | C
A | ATA I | ACA T | AAA K | AGA R | A
A | ATG M(s)| ACG T | AAG K | AGG R | G
--+---------+---------+---------+---------+--
G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V | GCG A | GAG E | GGG G | G
--+---------+---------+---------+---------+--
>>> print(mito_table)
Table 2 Vertebrate Mitochondrial, SGC1
| T | C | A | G |
--+---------+---------+---------+---------+--
T | TTT F | TCT S | TAT Y | TGT C | T
T | TTC F | TCC S | TAC Y | TGC C | C
T | TTA L | TCA S | TAA Stop| TGA W | A
T | TTG L | TCG S | TAG Stop| TGG W | G
--+---------+---------+---------+---------+--
C | CTT L | CCT P | CAT H | CGT R | T
C | CTC L | CCC P | CAC H | CGC R | C
C | CTA L | CCA P | CAA Q | CGA R | A
C | CTG L | CCG P | CAG Q | CGG R | G
--+---------+---------+---------+---------+--
A | ATT I(s)| ACT T | AAT N | AGT S | T
A | ATC I(s)| ACC T | AAC N | AGC S | C
A | ATA M(s)| ACA T | AAA K | AGA Stop| A
A | ATG M(s)| ACG T | AAG K | AGG Stop| G
--+---------+---------+---------+---------+--
G | GTT V | GCT A | GAT D | GGT G | T
G | GTC V | GCC A | GAC D | GGC G | C
G | GTA V | GCA A | GAA E | GGA G | A
G | GTG V(s)| GCG A | GAG E | GGG G | G
--+---------+---------+---------+---------+--
>>> mito_table.stop_codons
['TAA', 'TAG', 'AGA', 'AGG']
>>> mito_table.start_codons
['ATT', 'ATC', 'ATA', 'ATG', 'GTG']
>>> mito_table.forward_table["ACG"]
'T'
>>>
Comparing Seq objects:
======================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> seq1 = Seq("ACGT", IUPAC.unambiguous_dna)
>>> seq2 = Seq("ACGT", IUPAC.unambiguous_dna)
>>> seq1 == seq2
/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/Seq.py:174: FutureWarning: In future comparing Seq objects will use string comparison (not object comparison). Incompatible alphabets will trigger a warning (not an exception). In the interim please use id(seq1)==id(seq2) or str(seq1)==str(seq2) to make your code explicit and to avoid this warning.
"and to avoid this warning.", FutureWarning)
False
>>> id(seq1) == id(seq2)
False
>>> id(seq1) == id(seq1)
True
>>> str(seq1) == str(seq2)
True
>>> str(seq1) == str(seq2)
True
>>>
MutableSeq objects:
===================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)
>>> my_seq[5] = "G"
Traceback (most recent call last):
File "", line 1, in
TypeError: 'Seq' object does not support item assignment
>>> mutable_seq = my_seq.tomutable()
>>> mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> from Bio.Seq import MutableSeq
>>> mutable_seq = MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna)
>>> mutable_seq[5] = "C"
>>> mutable_seq
MutableSeq('GCCATCGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq.remove("T")
>>> mutable_seq
MutableSeq('GCCACGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq.reverse()
>>> mutable_seq
MutableSeq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG', IUPACUnambiguousDNA())
>>> str(mutable_seq)
'AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG'
>>> new_seq = mutable_seq.toseq()
>>> new_seq
Seq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG', IUPACUnambiguousDNA())
>>>
Difference between mutable and immutable objects in Python means that you can’t use a MutableSeq object as a dictionary key, but you can use a Python string or a Seq object in this way.
UnknownSeq object:
==================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import UnknownSeq
>>> unk = UnknownSeq(20)
>>> unk
UnknownSeq(20, alphabet = Alphabet(), character = '?')
>>> print(unk)
????????????????????
>>> len(unk)
20
>>> from Bio.Alphabet import IUPAC
>>> unk_dna = UnknownSeq(20, alphabet=IUPAC.ambiguous_dna)
>>> unk_dna
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
>>> print(unk_dna)
NNNNNNNNNNNNNNNNNNNN
>>> unk_dna.complement()
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
>>> unk_dna.reverse_complement()
UnknownSeq(20, alphabet = IUPACAmbiguousDNA(), character = 'N')
>>> unk_dna.transcribe()
UnknownSeq(20, alphabet = IUPACAmbiguousRNA(), character = 'N')
>>> unk_protein=unk_dna.translate()
>>> unk_protein
UnknownSeq(6, alphabet = ProteinAlphabet(), character = 'X')
>>> print(unk_protein)
XXXXXX
>>> len(unk_protein)
6
>>>
Use Biopython functions on strings:
===================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate
>>> my_string = "GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG"
>>> reverse_complement(my_string)
'CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC'
>>> transcribe(my_string)
'GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG'
>>> back_transcribe(my_string)
'GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG'
>>> translate(my_string)
'AVMGRWKGGRAAG*'
>>>
SeqRecord object:
=================
[uzi@quince-srv2 ~]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> simple_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> simple_seq_r = SeqRecord(simple_seq)
>>> simple_seq_r.id
''
>>> simple_seq_r.id="AC12345"
>>> simple_seq_r.description="made up sequence"
>>> print(simple_seq_r.description)
made up sequence
>>> simple_seq_r.seq
Seq('GATC', Alphabet())
>>> simple_seq_r=SeqRecord(simple_seq,id="AC12345")
>>> simple_seq_r.annotations["evidence"] = "made up"
>>> print(simple_seq_r.annotations)
{'evidence': 'made up'}
>>> print(simple_seq_r.annotations["evidence"])
made up
>>> simple_seq_r.letter_annotations["phred_quality"]=[40,40,38,30]
>>> print(simple_seq_r.letter_annotations)
{'phred_quality': [40, 40, 38, 30]}
>>> print(simple_seq_r.letter_annotations["phred_quality"])
[40, 40, 38, 30]
The SeqRecord class itself is quite simple, and offers the following information as attributes:
.seq
– The sequence itself, typically a Seq object.
.id
– The primary ID used to identify the sequence – a string. In most cases this is something like an accession number.
.name
– A “common” name/id for the sequence – a string. In some cases this will be the same as the accession number, but it could also be a clone name. I think of this as being analogous to the LOCUS id in a GenBank record.
.description
– A human readable description or expressive name for the sequence – a string.
.letter_annotations
– Holds per-letter-annotations using a (restricted) dictionary of additional information about the letters in the sequence. The keys are the name of the information, and the information is contained in the value as a Python sequence (i.e. a list, tuple or string) with the same length as the sequence itself. This is often used for quality scores or secondary structure information (e.g. from Stockholm/PFAM alignment files).
.annotations
– A dictionary of additional information about the sequence. The keys are the name of the information, and the information is contained in the value. This allows the addition of more “unstructured” information to the sequence.
.features
– A list of SeqFeature objects with more structured information about the features on a sequence (e.g. position of genes on a genome, or domains on a protein sequence).
.dbxrefs
- A list of database cross-references as strings.
SeqRecord example for a FASTA file:
===================================
Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1
http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.fna
[uzi@quince-srv2 ~]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> record=SeqIO.read("NC_005816.fna","fasta")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', SingleLetterAlphabet()), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|', description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[])
>>> record.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', SingleLetterAlphabet())
>>> record.id
'gi|45478711|ref|NC_005816.1|'
>>> record.name
'gi|45478711|ref|NC_005816.1|'
>>> record.description
'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'
>>> record.dbxrefs
[]
>>> record.annotations
{}
>>> record.letter_annotations
{}
>>> record.features
[]
>>>
SeqRecord example for a GB file:
================================
Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1
http://biopython.org/SRC/biopython/Tests/GenBank/NC_005816.gb
[uzi@quince-srv2 ~]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=['Project:58037'])
>>> record.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA())
>>> record.id
'NC_005816.1'
>>> record.name
'NC_005816'
>>> record.description
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.'
>>> record.letter_annotations
{}
>>> len(record.annotations)
11
>>> record.annotations
{'comment': 'PROVISIONAL REFSEQ: This record has not yet been subject to final\nNCBI review. The reference sequence was derived from AE017046.\nCOMPLETENESS: full length.', 'sequence_version': 1, 'source': 'Yersinia pestis biovar Microtus str. 91001', 'taxonomy': ['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacteriales', 'Enterobacteriaceae', 'Yersinia'], 'keywords': [''], 'references': [Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...), Reference(title='Complete genome sequence of Yersinia pestis strain 91001, an isolate avirulent to humans', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)], 'accessions': ['NC_005816'], 'data_file_division': 'BCT', 'date': '21-JUL-2008', 'organism': 'Yersinia pestis biovar Microtus str. 91001', 'gi': '45478711'}
>>> record.dbxrefs
['Project:58037']
>>> len(record.features)
41
>>> record.features
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(9609), strand=1), type='source'), SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1954), strand=1), type='repeat_region'), SeqFeature(FeatureLocation(ExactPosition(86), ExactPosition(1109), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(86), ExactPosition(1109), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(86), ExactPosition(959), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(BeforePosition(110), ExactPosition(209), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(437), ExactPosition(812), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(1105), ExactPosition(1888), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(1105), ExactPosition(1888), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(1108), ExactPosition(1885), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(1366), AfterPosition(1669), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(1432), ExactPosition(1456), strand=1), type='misc_feature'), SeqFeature(CompoundLocation([FeatureLocation(ExactPosition(1435), ExactPosition(1459), strand=1), FeatureLocation(ExactPosition(1618), ExactPosition(1621), strand=1)], 'order'), type='misc_feature', location_operator='order'), SeqFeature(FeatureLocation(ExactPosition(1606), ExactPosition(1624), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(2924), ExactPosition(3119), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(2924), ExactPosition(3119), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(2924), ExactPosition(3107), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(3485), ExactPosition(3857), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(3485), ExactPosition(3857), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(3497), ExactPosition(3626), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(4342), ExactPosition(4780), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(4342), ExactPosition(4780), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(4814), ExactPosition(5888), strand=-1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(4814), ExactPosition(5888), strand=-1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(5909), ExactPosition(5911), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(5933), ExactPosition(5933), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(5933), ExactPosition(5933), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(5947), ExactPosition(5948), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(6004), ExactPosition(6421), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(6004), ExactPosition(6421), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(6524), ExactPosition(6525), strand=1), type='variation'), SeqFeature(FeatureLocation(ExactPosition(6663), ExactPosition(7602), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(6663), ExactPosition(7602), strand=1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(6663), ExactPosition(7599), strand=1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(7788), ExactPosition(8088), strand=-1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(7788), ExactPosition(8088), strand=-1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(7836), ExactPosition(7995), strand=-1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(8087), ExactPosition(8360), strand=-1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(8087), ExactPosition(8360), strand=-1), type='CDS'), SeqFeature(FeatureLocation(ExactPosition(8090), AfterPosition(8357), strand=-1), type='misc_feature'), SeqFeature(FeatureLocation(ExactPosition(8529), ExactPosition(8529), strand=1), type='variation')]
>>>
SeqFeature object:
==================
The attributes of a SeqFeature are:
.type
– This is a textual description of the type of feature (for instance, this will be something like ‘CDS’ or ‘gene’).
.location
– The location of the SeqFeature on the sequence that you are dealing with. The SeqFeature delegates much of its functionality to the location object, and includes a number of shortcut attributes for properties of the location:
.ref
– shorthand for .location.ref – any (different) reference sequence the location is referring to. Usually just None.
.ref_db
– shorthand for .location.ref_db – specifies the database any identifier in .ref refers to. Usually just None.
.strand
– shorthand for .location.strand – the strand on the sequence that the feature is located on. For double stranded nucleotide sequence this may either be 1 for the top strand, −1 for the bottom strand, 0 if the strand is important but is unknown, or None if it doesn’t matter. This is None for proteins, or single stranded sequences.
.qualifiers
– This is a Python dictionary of additional information about the feature. The key is some kind of terse one-word description of what the information contained in the value is about, and the value is the actual information. For example, a common key for a qualifier might be “evidence” and the value might be “computational (non-experimental).” This is just a way to let the person who is looking at the feature know that it has not be experimentally (i. e. in a wet lab) confirmed. Note that other the value will be a list of strings (even when there is only one string). This is a reflection of the feature tables in GenBank/EMBL files.
.sub_features
– This used to be used to represent features with complicated locations like ‘joins’ in GenBank/EMBL files. This has been deprecated with the introduction of the CompoundLocation object, and should now be ignored.
How to make positions and locations:
====================================
[uzi@quince-srv2 ~]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqFeature
>>> start_pos=SeqFeature.AfterPosition(5)
>>> end_pos=SeqFeature.BetweenPosition(9,left=8,right=9)
>>> my_location=SeqFeature.FeatureLocation(start_pos,end_pos)
>>> print(my_location)
[>5:(8^9)]
>>> my_location.start
AfterPosition(5)
>>> print(my_location.start)
>5
>>> my_location.end
BetweenPosition(9, left=8, right=9)
>>> print(my_location.end)
(8^9)
>>> int(my_location.start)
5
>>> int(my_location.end)
9
>>> my_location.nofuzzy_start
5
>>> my_location.nofuzzy_end
9
>>> exact_location=SeqFeature.FeatureLocation(5,9)
>>> print(exact_location)
[5:9]
>>> exact_location.start
ExactPosition(5)
>>> int(exact_location.start)
5
>>> exact_location.nofuzzy_start
5
>>>
position
– This refers to a single position on a sequence, which may be fuzzy or not. For instance, 5, 20, <100 and >200 are all positions.
– ExactPosition, BeforePosition (something like <13), AfterPosition(something like >13), WithinPosition (something like 1.5. The position attribute specifies the lower boundary of the range we are looking at, so in our example case this would be one. The extension attribute specifies the range to the higher boundary, so in this case it would be 4. So object.position is the lower boundary and object.position + object.extension is the upper boundary.), OneofPosition, UnknownPosition(This is not used in GenBank/EMBL, but corresponds to the ‘?’ feature coordinate used in UniProt)
location
– A location is region of sequence bounded by some positions. For instance 5..20 (i. e. 5 to 20) is a location.
– FeatureLocation/CompoundLocation (for handling ‘join’ locations in EMBL/GenBank files)
Locating SNP in a feature:
==========================
[uzi@quince-srv2 ~]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> my_snp=4350
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> for feature in record.features:
... if my_snp in feature:
... print("%s %s" % (feature.type, feature.qualifiers.get('db_xref')))
...
source ['taxon:229193']
gene ['GeneID:2767712']
CDS ['GI:45478716', 'GeneID:2767712']
>>>
You can use the Python keyword in with a SeqFeature or location object to see if the base/residue for a parent coordinate is within the feature/location or not
Extracting features:
====================
[uzi@quince-srv2 ~]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.SeqFeature import SeqFeature, FeatureLocation
>>> example_parent = Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACTGGGAGCCTAC")
>>> example_feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1)
>>> feature_seq = example_parent[example_feature.location.start:example_feature.location.end].reverse_complement()
>>> print(feature_seq)
AGCCTTTGCCGTC
>>> feature_seq = example_feature.extract(example_parent)
>>> print(feature_seq)
AGCCTTTGCCGTC
>>> print(len(example_feature.extract(example_parent)))
13
>>> print(len(example_feature))
13
>>> print(len(example_feature.location))
13
>>>
For simple FeatureLocation objects the length is just the difference between the start and end positions. However, for a CompoundLocation the length is the sum of the constituent regions.
We also have a Bio.SeqFeature.Reference class that stores the relevant information about a reference as attributes of an object.
Any reference objects are stored as a list in the SeqRecord object’s annotations dictionary under the key “references”
Printing as FASTA:
==================
[uzi@quince-srv2 ~]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Alphabet import generic_protein
>>> record = SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD" \
... +"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK" \
... +"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM" \
... +"SSAC", generic_protein),id="gi|14150838|gb|AAK54648.1|AF376133_1",description="chalcone synthase [Cucumis sativus]")
>>> print(record.format("fasta"))
>gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis sativus]
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD
GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK
NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM
SSAC
>>>
Slicing a SeqRecord:
====================
We’re going to focus in on the pim gene, YP_pPCP05. If you have a look at the GenBank file directly you’ll find this gene/CDS has location string 4343..4780, or in Python counting 4342:4780. From looking at the file you can work out that these are the twelfth and thirteenth entries in the file, so in Python zero-based counting they are entries 11 and 12 in the features list
[uzi@quince-srv2 ~]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> record=SeqIO.read("NC_005816.gb", "genbank")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=['Project:58037'])
>>> len(record)
9609
>>> len(record.features)
41
>>> print(record.features[20])
type: gene
location: [4342:4780](+)
qualifiers:
Key: db_xref, Value: ['GeneID:2767712']
Key: gene, Value: ['pim']
Key: locus_tag, Value: ['YP_pPCP05']
>>> print(record.features[21])
type: CDS
location: [4342:4780](+)
qualifiers:
Key: codon_start, Value: ['1']
Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
Key: gene, Value: ['pim']
Key: locus_tag, Value: ['YP_pPCP05']
Key: note, Value: ['similar to many previously sequenced pesticin immunity protein entries of Yersinia pestis plasmid pPCP, e.g. gi| 16082683|,ref|NP_395230.1| (NC_003132) , gi|1200166|emb|CAA90861.1| (Z54145 ) , gi|1488655| emb|CAA63439.1| (X92856) , gi|2996219|gb|AAC62543.1| (AF053945) , and gi|5763814|emb|CAB531 67.1| (AL109969)']
Key: product, Value: ['pesticin immunity protein']
Key: protein_id, Value: ['NP_995571.1']
Key: transl_table, Value: ['11']
Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSHGIYGKQTTFKQTEFTNIKSNTKKHIALINKDNSWMISLKILGIKRDEYTVCFEDFSLIRPPTYVAIHPLLIKKVKSGNFIVVKEIKKSIPGCTVYYH']
>>> sub_record=record[4300:4800]
>>> sub_record
SeqRecord(seq=Seq('ATAAATAGATTATTCCAAATAATTTATTTATGTAAGAACAGGATGGGAGGGGGA...TTA', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=[])
>>> len(sub_record)
500
>>> sub_record.features
[SeqFeature(FeatureLocation(ExactPosition(42), ExactPosition(480), strand=1), type='gene'), SeqFeature(FeatureLocation(ExactPosition(42), ExactPosition(480), strand=1), type='CDS')]
>>> sub_record.features[0]
SeqFeature(FeatureLocation(ExactPosition(42), ExactPosition(480), strand=1), type='gene')
>>> print(sub_record.features[0])
type: gene
location: [42:480](+)
qualifiers:
Key: db_xref, Value: ['GeneID:2767712']
Key: gene, Value: ['pim']
Key: locus_tag, Value: ['YP_pPCP05']
>>> print(sub_record.features[1])
type: CDS
location: [42:480](+)
qualifiers:
Key: codon_start, Value: ['1']
Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
Key: gene, Value: ['pim']
Key: locus_tag, Value: ['YP_pPCP05']
Key: note, Value: ['similar to many previously sequenced pesticin immunity protein entries of Yersinia pestis plasmid pPCP, e.g. gi| 16082683|,ref|NP_395230.1| (NC_003132) , gi|1200166|emb|CAA90861.1| (Z54145 ) , gi|1488655| emb|CAA63439.1| (X92856) , gi|2996219|gb|AAC62543.1| (AF053945) , and gi|5763814|emb|CAB531 67.1| (AL109969)']
Key: product, Value: ['pesticin immunity protein']
Key: protein_id, Value: ['NP_995571.1']
Key: transl_table, Value: ['11']
Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSHGIYGKQTTFKQTEFTNIKSNTKKHIALINKDNSWMISLKILGIKRDEYTVCFEDFSLIRPPTYVAIHPLLIKKVKSGNFIVVKEIKKSIPGCTVYYH']
>>> sub_record.description = "Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, partial."
>>> print(sub_record.format("genbank"))
LOCUS NC_005816 500 bp DNA UNK 01-JAN-1980
DEFINITION Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, partial.
ACCESSION NC_005816
VERSION NC_005816.1
KEYWORDS .
SOURCE .
ORGANISM .
.
FEATURES Location/Qualifiers
gene 43..480
/db_xref="GeneID:2767712"
/locus_tag="YP_pPCP05"
/gene="pim"
CDS 43..480
/locus_tag="YP_pPCP05"
/codon_start=1
/product="pesticin immunity protein"
/transl_table=11
/note="similar to many previously sequenced pesticin
immunity protein entries of Yersinia pestis plasmid pPCP,
e.g. gi| 16082683|,ref|NP_395230.1| (NC_003132) ,
gi|1200166|emb|CAA90861.1| (Z54145 ) , gi|1488655|
emb|CAA63439.1| (X92856) , gi|2996219|gb|AAC62543.1|
(AF053945) , and gi|5763814|emb|CAB531 67.1| (AL109969)"
/db_xref="GI:45478716"
/db_xref="GeneID:2767712"
/translation="MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNS
LSHGIYGKQTTFKQTEFTNIKSNTKKHIALINKDNSWMISLKILGIKRDEYTVCFEDFS
LIRPPTYVAIHPLLIKKVKSGNFIVVKEIKKSIPGCTVYYH"
/gene="pim"
/protein_id="NP_995571.1"
ORIGIN
1 ataaatagat tattccaaat aatttattta tgtaagaaca ggatgggagg gggaatgatc
61 tcaaagttat tttgcttggc tctcatattt ttatcatcaa gtggccttgc agaaaaaaac
121 acatatacag caaaagacat cttgcaaaac ctagaattaa atacctttgg caattcattg
181 tctcatggca tctatgggaa acagacaacc ttcaagcaaa ccgagtttac aaatattaaa
241 agcaacacca aaaaacacat tgcacttatc aataaagaca actcatggat gatatcatta
301 aaaatactag gaattaagag agatgagtat actgtctgtt ttgaagattt ctctctaata
361 agaccgccaa catatgtagc catacatcct ctacttataa aaaaagtaaa atctggaaac
421 tttatagtag tgaaagaaat aaagaaatct atccctggtt gcactgtata ttatcattaa
481 tagcaagccc ctcattatta
//
>>>
Adding SeqRecord objects:
=========================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> record = next(SeqIO.parse("example.fastq", "fastq"))
>>> len(record)
25
>>> print(record.seq)
CCCTTCTTGTCTTCAGCGTTTCTCC
>>> print(record.letter_annotations["phred_quality"])
[26, 26, 18, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 22, 26, 26, 26, 26, 26, 26, 26, 23, 23]
>>> edited=record[:20]+record[21:]
>>> print(edited.seq)
CCCTTCTTGTCTTCAGCGTTCTCC
>>> print(edited.letter_annotations["phred_quality"])
[26, 26, 18, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 26, 22, 26, 26, 26, 26, 26, 26, 23, 23]
>>>
Shifting circular genome:
=========================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb","genbank")
>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=['Project:58037'])
>>> len(record)
9609
>>> len(record.features)
41
>>> record.annotations.keys()
['comment', 'sequence_version', 'source', 'taxonomy', 'keywords', 'references', 'accessions', 'data_file_division', 'date', 'organism', 'gi']
>>> shifted = record[2000:]+record[:2000]
>>> shifted
SeqRecord(seq=Seq('GATACGCAGTCATATTTTTTACACAATTCTCTAATCCCGACAAGGTCGTAGGTC...GGA', IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.', dbxrefs=[])
>>> len(shifted)
9609
>>> len(shifted.features)
40
>>> shifted.dbxrefs
[]
>>> shifted.annotations.keys()
[]
>>> shifted.dbxrefs=record.dbxrefs[:]
>>> shifted.annotations=record.annotations.copy()
>>> shifted.dbxrefs
['Project:58037']
>>> shifted.annotations
{'comment': 'PROVISIONAL REFSEQ: This record has not yet been subject to final\nNCBI review. The reference sequence was derived from AE017046.\nCOMPLETENESS: full length.', 'date': '21-JUL-2008', 'taxonomy': ['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacteriales', 'Enterobacteriaceae', 'Yersinia'], 'gi': '45478711', 'source': 'Yersinia pestis biovar Microtus str. 91001', 'references': [Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...), Reference(title='Complete genome sequence of Yersinia pestis strain 91001, an isolate avirulent to humans', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)], 'accessions': ['NC_005816'], 'data_file_division': 'BCT', 'keywords': [''], 'organism': 'Yersinia pestis biovar Microtus str. 91001', 'sequence_version': 1}
>>>
Reverse-complementing SeqRecord object:
=======================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> print("%s %i %i %i %i" % (record.id, len(record), len(record.features), len(record.dbxrefs), len(record.annotations)))
NC_005816.1 9609 41 1 11
>>> rc = record.reverse_complement(id="TESTING")
>>> print("%s %i %i %i %i" % (rc.id, len(rc), len(rc.features), len(rc.dbxrefs), len(rc.annotations)))
TESTING 9609 41 0 0
>>>
List comprehension on SeqRecord:
================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> identifiers = [seq_record.id for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank")]
>>> identifiers
['Z78533.1', 'Z78532.1', 'Z78531.1', 'Z78530.1', 'Z78529.1', 'Z78527.1', 'Z78526.1', 'Z78525.1', 'Z78524.1', 'Z78523.1', 'Z78522.1', 'Z78521.1', 'Z78520.1', 'Z78519.1', 'Z78518.1', 'Z78517.1', 'Z78516.1', 'Z78515.1', 'Z78514.1', 'Z78513.1', 'Z78512.1', 'Z78511.1', 'Z78510.1', 'Z78509.1', 'Z78508.1', 'Z78507.1', 'Z78506.1', 'Z78505.1', 'Z78504.1', 'Z78503.1', 'Z78502.1', 'Z78501.1', 'Z78500.1', 'Z78499.1', 'Z78498.1', 'Z78497.1', 'Z78496.1', 'Z78495.1', 'Z78494.1', 'Z78493.1', 'Z78492.1', 'Z78491.1', 'Z78490.1', 'Z78489.1', 'Z78488.1', 'Z78487.1', 'Z78486.1', 'Z78485.1', 'Z78484.1', 'Z78483.1', 'Z78482.1', 'Z78481.1', 'Z78480.1', 'Z78479.1', 'Z78478.1', 'Z78477.1', 'Z78476.1', 'Z78475.1', 'Z78474.1', 'Z78473.1', 'Z78472.1', 'Z78471.1', 'Z78470.1', 'Z78469.1', 'Z78468.1', 'Z78467.1', 'Z78466.1', 'Z78465.1', 'Z78464.1', 'Z78463.1', 'Z78462.1', 'Z78461.1', 'Z78460.1', 'Z78459.1', 'Z78458.1', 'Z78457.1', 'Z78456.1', 'Z78455.1', 'Z78454.1', 'Z78453.1', 'Z78452.1', 'Z78451.1', 'Z78450.1', 'Z78449.1', 'Z78448.1', 'Z78447.1', 'Z78446.1', 'Z78445.1', 'Z78444.1', 'Z78443.1', 'Z78442.1', 'Z78441.1', 'Z78440.1', 'Z78439.1']
>>> all_species = [seq_record.annotations["organism"] for seq_record in \
... SeqIO.parse("ls_orchid.gbk", "genbank")]
>>> print(all_species[1:5])
['Cypripedium californicum', 'Cypripedium fasciculatum', 'Cypripedium margaritaceum', 'Cypripedium lichiangense']
>>> import gzip
>>> handle = gzip.open("ls_orchid.gbk.gz", "r")
>>> print(sum(len(r) for r in SeqIO.parse(handle, "gb")))
67518
>>> handle.close()
>>> import bz2
>>> handle = bz2.BZ2File("ls_orchid.gbk.bz2", "r")
>>> print(sum(len(r) for r in SeqIO.parse(handle, "gb")))
67518
>>> handle.close()
>>> records = (rec.reverse_complement(id="rc_"+rec.id, description = "reverse complement") \
... for rec in SeqIO.parse("ls_orchid.fasta", "fasta") if len(rec)<700)
>>> SeqIO.write(records, "rev_comp.fasta", "fasta")
18
Manually iterating through SeqIO.parse:
=======================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> record_iterator = SeqIO.parse("ls_orchid.fasta", "fasta")
>>> first_record = next(record_iterator)
>>> print(first_record.id)
gi|2765658|emb|Z78533.1|CIZ78533
>>> second_record = next(record_iterator)
>>> print(second_record.id)
gi|2765657|emb|Z78532.1|CCZ78532
>>>
Getting a list of records from SeqIO:
=====================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> records = list(SeqIO.parse("ls_orchid.gbk", "genbank"))
>>> print("Found %i records" % len(records))
Found 94 records
>>> print(records[-1].id)
Z78439.1
>>> print(records[0].id)
Z78533.1
>>>
Searching DBs online:
=====================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import Entrez
>>> from Bio import SeqIO
>>> Entrez.email = "A.N.Other@example.com"
>>> handle = Entrez.efetch(db="nucleotide", rettype="fasta", retmode="text", id="6273291")
>>> seq_record = SeqIO.read(handle, "fasta")
>>> handle.close()
>>> print("%s with %i features" % (seq_record.id, len(seq_record.features)))
gi|6273291|gb|AF191665.1|AF191665 with 0 features
>>> handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id="6273291")
>>> seq_record = SeqIO.read(handle, "gb") #using "gb" as an alias for "genbank"
>>> handle.close()
>>> print("%s with %i features" % (seq_record.id, len(seq_record.features)))
AF191665.1 with 3 features
>>> handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", \
... id="6273291,6273290,6273289")
>>> for seq_record in SeqIO.parse(handle, "gb"):
... print seq_record.id, seq_record.description[:50] + "..."
... print "Sequence length %i," % len(seq_record),
... print "%i features," % len(seq_record.features),
... print "from: %s" % seq_record.annotations["source"]
...
AF191665.1 Opuntia marenae rpl16 gene; chloroplast gene for c...
Sequence length 902, 3 features, from: chloroplast Corynopuntia marenae
AF191664.1 Opuntia clavata rpl16 gene; chloroplast gene for c...
Sequence length 899, 3 features, from: chloroplast Grusonia clavata
AF191663.1 Opuntia bradtiana rpl16 gene; chloroplast gene for...
Sequence length 899, 3 features, from: chloroplast Grusonia bradtiana
>>> handle.close()
>>> from Bio import ExPASy
>>> handle = ExPASy.get_sprot_raw("O23729")
>>> seq_record = SeqIO.read(handle, "swiss")
>>> handle.close()
>>> print(seq_record.id)
O23729
>>> print(seq_record.name)
CHS3_BROFI
>>> print(seq_record.description)
RecName: Full=Chalcone synthase 3; EC=2.3.1.74; AltName: Full=Naringenin-chalcone synthase 3;
>>> print(repr(seq_record.seq))
Seq('MAPAMEEIRQAQRAEGPAAVLAIGTSTPPNALYQADYPDYYFRITKSEHLTELK...GAE', ProteinAlphabet())
>>> print("Length %i" % len(seq_record))
Length 394
>>> print(seq_record.annotations["keywords"])
['Acyltransferase', 'Flavonoid biosynthesis', 'Transferase']
>>>
Sequence files as dictionaries:
===============================
Bio.SeqIO.to_dict() --> This is basically a helper function to build a normal Python dictionary with each entry held as a SeqRecord object in memory, allowing you to modify the records.
Bio.SeqIO.index() --> Useful middle ground, acting like a read only dictionary and parsing sequences into SeqRecord objects on demand
Bio.SeqIO.index_db() --> Acts like a read only dictionary but stores the identifiers and file offsets in a file on disk (as an SQLite3 database), meaning it has very low memory requirements
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk", "genbank"))
>>> len(orchid_dict)
94
>>> orchid_dict.keys()[1:10]
['Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1', 'Z78453.1', 'Z78441.1', 'Z78444.1', 'Z78529.1', 'Z78505.1']
>>> orchid_dict.values()[1:3]
[SeqRecord(seq=Seq('CGTAACAAGGTTTCCGTAGGTGAGCGGAAGGGTCATTGTTGAGATCACATAATA...AGC', IUPACAmbiguousDNA()), id='Z78464.1', name='Z78464', description='P.glanduliferum 5.8S rRNA gene and ITS1 and ITS2 DNA.', dbxrefs=[]), SeqRecord(seq=Seq('CGTAACCAGGTTTCCGTAGGTGGACCTTCGGGAGGATCATTTTTGAGATCACAT...GCA', IUPACAmbiguousDNA()), id='Z78455.1', name='Z78455', description='P.javanicum 5.8S rRNA gene and ITS1 and ITS2 DNA.', dbxrefs=[])]
>>> seq_record = orchid_dict["Z78475.1"]
>>> print(seq_record)
ID: Z78475.1
Name: Z78475
Description: P.supardii 5.8S rRNA gene and ITS1 and ITS2 DNA.
Number of features: 5
/sequence_version=1
/source=Paphiopedilum supardii
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Paphiopedilum']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
/references=[Reference(title='Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...), Reference(title='Direct Submission', ...)]
/accessions=['Z78475']
/data_file_division=PLN
/date=30-NOV-2006
/organism=Paphiopedilum supardii
/gi=2765600
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', IUPACAmbiguousDNA())
>>> orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta", "fasta"))
>>> print(orchid_dict.keys()[1:5])
['gi|2765646|emb|Z78521.1|CCZ78521', 'gi|2765576|emb|Z78451.1|PHZ78451', 'gi|2765577|emb|Z78452.1|PBZ78452', 'gi|2765586|emb|Z78461.1|PWZ78461']
>>> def get_accession(record):
... """"Given a SeqRecord, return the accession number as a string.
...
... e.g. "gi|2765613|emb|Z78488.1|PTZ78488" -> "Z78488.1"
... """
... parts = record.id.split("|")
... assert len(parts) == 5 and parts[0] == "gi" and parts[2] == "emb"
... return parts[3]
...
>>> orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta", "fasta"), key_function=get_accession)
>>> print(orchid_dict.keys()[1:5])
['Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1']
>>> from Bio.SeqUtils.CheckSum import seguid
>>> seguid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk", "genbank"),
... lambda rec : seguid(rec.seq))
>>> seguid_dict["4UjNimWz3PcmI0F0BOC5IksDrOg"]
SeqRecord(seq=Seq('GGAAGGTCATTGCCGATATCACATAATAATTGATCGAGTTAATCTGGAGGATCT...GAG', IUPACAmbiguousDNA()), id='Z78441.1', name='Z78441', description='P.superbiens 5.8S rRNA gene and ITS1 and ITS2 DNA.', dbxrefs=[])
>>> orchid_dict = SeqIO.index("ls_orchid.gbk", "genbank")
>>> orchid_dict.keys()[1:5]
['Z78464.1', 'Z78455.1', 'Z78442.1', 'Z78532.1']
>>> orchid_dict["Z78475.1"].seq
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT', IUPACAmbiguousDNA())
>>>
Getting the raw data for a record:
==================================
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> uniprot = SeqIO.index("uniprot_sprot.dat", "swiss")
>>> handle = open("selected.dat", "w")
>>> for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]:
... handle.write(uniprot.get_raw(acc))
...
>>> handle.close()
>>> selected=SeqIO.index("selected.dat","swiss")
>>> selected
SeqIO.index('selected.dat', 'swiss', alphabet=None, key_function=None)
>>> selected.keys()
['P19801', 'Q9TRC7', 'P33487', 'Q8JZQ5', 'P13689']
Indexing in sqlite3 database:
=============================
ftp://ftp.ncbi.nih.gov/genbank/
get all the viral sequences gbvrl1.seq,gbvl2.seq, ....
and store that in a subfolder viral
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> files = ["viral/gbvrl%i.seq" % (i+1) for i in range(29)]
>>> gb_vrl = SeqIO.index_db("gbvrl.idx", files, "genbank")
>>> print("%i sequences indexed" % len(gb_vrl))
1507726 sequences indexed
>>> print(gb_vrl["GQ333173.1"].description)
HIV-1 isolate F12279A1 from Uganda gag protein (gag) gene, partial cds.
>>> print(gb_vrl.get_raw("GQ333173.1"))
LOCUS GQ333173 459 bp DNA linear VRL 21-OCT-2009
DEFINITION HIV-1 isolate F12279A1 from Uganda gag protein (gag) gene, partial
cds.
ACCESSION GQ333173
VERSION GQ333173.1 GI:254928714
KEYWORDS .
SOURCE Human immunodeficiency virus 1 (HIV-1)
ORGANISM Human immunodeficiency virus 1
Viruses; Retro-transcribing viruses; Retroviridae;
Orthoretrovirinae; Lentivirus; Primate lentivirus group.
REFERENCE 1 (bases 1 to 459)
AUTHORS Gray,R.R., Tatem,A.J., Lamers,S., Hou,W., Laeyendecker,O.,
Serwadda,D., Sewankambo,N., Gray,R.H., Wawer,M., Quinn,T.C.,
Goodenow,M.M. and Salemi,M.
TITLE Spatial phylodynamics of HIV-1 epidemic emergence in east Africa
JOURNAL AIDS 23 (14), F9-F17 (2009)
PUBMED 19644346
REFERENCE 2 (bases 1 to 459)
AUTHORS Collinson-Streng,A.N., Redd,A.D., Sewankambo,N.K., Serwadda,D.,
Rezapour,M., Lamers,S.L., Gray,R.H., Wawer,M.J., Quinn,T.C. and
Laeyendecker,O.
TITLE Geographic HIV Type 1 Subtype Distribution in Rakai District,
Uganda
JOURNAL AIDS Res. Hum. Retroviruses 25 (10), 1045-1048 (2009)
PUBMED 19803713
REFERENCE 3 (bases 1 to 459)
AUTHORS Collinson-Streng,A., Redd,A.D., Sewankambo,N.K., Serwadda,D.,
Rezapour,M., Lamers,S.L., Gray,R.H., Wawer,M.J., Quinn,T.C. and
Laeyendecker,O.
TITLE Direct Submission
JOURNAL Submitted (26-JUN-2009) National Institutes of Health, National
Institute of Allergy and Infectious Diseases, Baltimore, MD 21205,
USA
FEATURES Location/Qualifiers
source 1..459
/organism="Human immunodeficiency virus 1"
/proviral
/mol_type="genomic DNA"
/isolate="F12279A1"
/host="Homo sapiens"
/db_xref="taxon:11676"
/country="Uganda:Kasasa-Sanje"
/collection_date="19-Jun-1995"
/note="subtype: D"
gene <1..>459
/gene="gag"
CDS <1..>459
/gene="gag"
/note="p24"
/codon_start=1
/product="gag protein"
/protein_id="ACT86536.1"
/db_xref="GI:254928715"
/translation="LNAWVKVIEEKAFSPEVIPMFSALSEGATPQDLNTMLNTVGGHQ
AAMQMLKETINEEAAEWDRLHPVHAGPIAPGQMREPRGSDIAGTTSTLQEQIGWMTNN
PPIPVGEIYKRWIILGLNKIVRMYSPVSILDIRQGPKEPFRDYVDRFYKTL"
ORIGIN
1 ttgaatgcat gggtaaaagt aatagaggag aaggctttca gcccagaagt aatacccatg
61 ttttcagcat tatcagaagg agccacccca caagatttaa acaccatgct aaacacagtg
121 gggggacatc aagcagctat gcaaatgtta aaagagacaa tcaatgagga agctgcagaa
181 tgggataggc tacatccagt acatgcaggg cctattgcac caggccaaat gagagaacct
241 aggggaagtg atatagcagg aactactagt acccttcagg aacaaatagg atggatgaca
301 aacaatccac ctatcccagt aggagaaatc tataaaagat ggataatcct gggattaaat
361 aaaatagtaa gaatgtatag ccctgtcagc attttggaca taagacaagg accaaaggaa
421 ccctttagag actatgtaga tcggttctat aaaacttta
//
>>>
Compressing and loading files (Blocked GNU zip format):
=======================================================
>>> from Bio import SeqIO
>>> orchid_dict = SeqIO.index("ls_orchid.gbk", "genbank")
>>> len(orchid_dict)
94
$ bgzip -c ls_orchid.gbk > ls_orchid.gbk.bgz
>>> from Bio import SeqIO
>>> orchid_dict = SeqIO.index("ls_orchid.gbk.bgz", "genbank")
>>> len(orchid_dict)
94
>>> from Bio import SeqIO
>>> orchid_dict = SeqIO.index_db("ls_orchid.gbk.bgz.idx", "ls_orchid.gbk.bgz", "genbank")
>>> len(orchid_dict)
94
Writing multiple records:
=========================
Say rec1, rec2, and rec3 are all SeqRecords
then,
my_records=[rec1,rec2,rec3]
and use
from Bio import SeqIO
SeqIO.write(my_records, "my_example.faa", "fasta")
To convert between different sequence formats
from Bio import SeqIO
records = SeqIO.parse("ls_orchid.gbk", "genbank")
count = SeqIO.write(records, "my_example.fasta", "fasta")
print("Converted %i records" % count)
or
from Bio import SeqIO
count = SeqIO.convert("ls_orchid.gbk", "genbank", "my_example.fasta", "fasta")
print("Converted %i records" % count)
Formatted string from SeqRecord:
================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> from StringIO import StringIO
>>> records = SeqIO.parse("ls_orchid.gbk", "genbank")
>>> out_handle = StringIO()
>>> SeqIO.write(records, out_handle, "fasta")
94
>>> fasta_data = out_handle.getvalue()
>>> fasta_data[0:200]
'>Z78533.1 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA.\nCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAA\nCGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGT\nGACCCTGATTTGTTGTTG'
>>> out_handle = open("ls_orchid_long.tab", "w")
>>> for record in SeqIO.parse("ls_orchid.gbk", "genbank"):
... if len(record) > 100:
... out_handle.write(record.format("tab"))
...
>>> out_handle.close()
[uzi@quince-srv2 ~/biopython]$ head -2 ls_orchid_long.tab
Z78533.1 CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC
Z78532.1 CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTACGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAGAAAAAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGGATATCTTGGCTCTAACATCGATGAAGAACGCAGCTAAATGCGATAAGTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCTCGAGGCCATCAGGCTAAGGGCACGCCTGCCTGGGCGTCGTGTGTTGCGTCTCTCCTACCAATGCTTGCTTGGCATATCGCTAAGCTGGCATTATACGGATGTGAATGATTGGCCCCTTGTGCCTAGGTGCGGTGGGTCTAAGGATTGTTGCTTTGATGGGTAGGAATGTGGCACGAGGTGGAGAATGCTAACAGTCATAAGGCTGCTATTTGAATCCCCCATGTTGTTGTATTTTTTCGAACCTACACAAGAACCTAATTGAACCCCAATGGAGCTAAAATAACCATTGGGCAGTTGATTTCCATTCAGATGCGACCCCAGGTCAGGCGGGGCCACCCGCTGAGTTGAGGC
Bio.AlignIO:
============
Bio.AlignIO.parse() --> Will return an iterator which gives MultipleSeqAlignment objects
Bio.AlignIO.read() --> Returns a single MultipleSeqAlignment object
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import AlignIO
>>> alignment = AlignIO.read("PF05371_seed.sth", "stockholm")
>>> print(alignment)
SingleLetterAlphabet() alignment with 7 rows and 52 columns
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKA COATB_BPIKE/30-81
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRA Q9T0Q8_BPIKE/1-52
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKA COATB_BPI22/32-83
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPM13/24-72
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPZJ2/1-49
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA Q9T0Q9_BPFD/1-49
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRA COATB_BPIF1/22-73
>>> print("Alignment length %i" % alignment.get_alignment_length())
Alignment length 52
>>> for record in alignment:
... print("%s - %s" % (record.seq, record.id))
...
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA - COATB_BPIKE/30-81
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA - Q9T0Q8_BPIKE/1-52
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA - COATB_BPI22/32-83
AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - COATB_BPM13/24-72
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA - COATB_BPZJ2/1-49
AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - Q9T0Q9_BPFD/1-49
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA - COATB_BPIF1/22-73
>>> for record in alignment:
... if record.dbxrefs:
... print("%s %s" % (record.id, record.dbxrefs))
...
COATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;']
COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;']
Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;']
COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;']
>>> for record in alignment:
... print(record)
...
ID: COATB_BPIKE/30-81
Name: COATB_BPIKE
Description: COATB_BPIKE/30-81
Database cross-references: PDB; 1ifl ; 1-52;
Number of features: 0
/start=30
/end=81
/accession=P03620.1
Per letter annotation for: secondary_structure
Seq('AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA', SingleLetterAlphabet())
ID: Q9T0Q8_BPIKE/1-52
Name: Q9T0Q8_BPIKE
Description: Q9T0Q8_BPIKE/1-52
Number of features: 0
/start=1
/end=52
/accession=Q9T0Q8.1
Seq('AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA', SingleLetterAlphabet())
ID: COATB_BPI22/32-83
Name: COATB_BPI22
Description: COATB_BPI22/32-83
Number of features: 0
/start=32
/end=83
/accession=P15416.1
Seq('DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA', SingleLetterAlphabet())
ID: COATB_BPM13/24-72
Name: COATB_BPM13
Description: COATB_BPM13/24-72
Database cross-references: PDB; 2cpb ; 1-49;, PDB; 2cps ; 1-49;
Number of features: 0
/start=24
/end=72
/accession=P69541.1
Per letter annotation for: secondary_structure
Seq('AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA', SingleLetterAlphabet())
ID: COATB_BPZJ2/1-49
Name: COATB_BPZJ2
Description: COATB_BPZJ2/1-49
Number of features: 0
/start=1
/end=49
/accession=P03618.1
Seq('AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA', SingleLetterAlphabet())
ID: Q9T0Q9_BPFD/1-49
Name: Q9T0Q9_BPFD
Description: Q9T0Q9_BPFD/1-49
Database cross-references: PDB; 1nh4 A; 1-49;
Number of features: 0
/start=1
/end=49
/accession=Q9T0Q9.1
Per letter annotation for: secondary_structure
Seq('AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA', SingleLetterAlphabet())
ID: COATB_BPIF1/22-73
Name: COATB_BPIF1
Description: COATB_BPIF1/22-73
Database cross-references: PDB; 1ifk ; 1-50;
Number of features: 0
/start=22
/end=73
/accession=P03619.2
Per letter annotation for: secondary_structure
Seq('FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA', SingleLetterAlphabet())
>>> alignment = AlignIO.read("PF05371_seed.faa", "fasta")
>>> print(alignment)
SingleLetterAlphabet() alignment with 7 rows and 52 columns
AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA Q9T0Q9_BPFD/1-49
AEGDDP...AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA CAPSD_BPM13/24-72
AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA CAPSD_BPZJ2/1-49
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRA CAPSD_BPIF1/22-73
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKA CAPSD_BPI22/32-83
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRA Q9T0Q8_BPIKE/1-52
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKA CAPSD_BPIKE/30-81
>>>
Multiple pair-wise alignments:
==============================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import AlignIO
>>> for alignment in AlignIO.parse("multiplepairwise.fasta", "fasta", seq_count=2):
... print("Alignment length %i" % alignment.get_alignment_length())
... for record in alignment:
... print("%s - %s" % (record.seq, record.id))
... print("")
...
Alignment length 19
ACTACGACTAGCTCAG--G - Alpha
ACTACCGCTAGCTCAGAAG - XXX
Alignment length 17
ACTACGACTAGCTCAGG - Alpha
ACTACGGCAAGCACAGG - YYY
Alignment length 21
--ACTACGAC--TAGCTCAGG - Alpha
GGACTACGACAATAGCTCAGG - ZZZ
>>>
Writing multiple alignments:
============================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Alphabet import generic_dna
>>> from Bio.Seq import Seq
>>> from Bio.SeqRecord import SeqRecord
>>> from Bio.Align import MultipleSeqAlignment
>>> align1 = MultipleSeqAlignment([
... SeqRecord(Seq("ACTGCTAGCTAG", generic_dna), id="Alpha"),
... SeqRecord(Seq("ACT-CTAGCTAG", generic_dna), id="Beta"),
... SeqRecord(Seq("ACTGCTAGDTAG", generic_dna), id="Gamma"),
... ])
>>> align2 = MultipleSeqAlignment([
... SeqRecord(Seq("GTCAGC-AG", generic_dna), id="Delta"),
... SeqRecord(Seq("GACAGCTAG", generic_dna), id="Epsilon"),
... SeqRecord(Seq("GTCAGCTAG", generic_dna), id="Zeta"),
... ])
>>> align3 = MultipleSeqAlignment([
... SeqRecord(Seq("ACTAGTACAGCTG", generic_dna), id="Eta"),
... SeqRecord(Seq("ACTAGTACAGCT-", generic_dna), id="Theta"),
... SeqRecord(Seq("-CTACTACAGGTG", generic_dna), id="Iota"),
... ])
>>> my_alignments = [align1, align2, align3]
>>> from Bio import AlignIO
>>> AlignIO.write(my_alignments, "my_example.phy", "phylip")
3
>>>
[uzi@quince-srv2 ~/biopython]$ cat my_example.phy
3 12
Alpha ACTGCTAGCT AG
Beta ACT-CTAGCT AG
Gamma ACTGCTAGDT AG
3 9
Delta GTCAGC-AG
Epsilon GACAGCTAG
Zeta GTCAGCTAG
3 13
Eta ACTAGTACAG CTG
Theta ACTAGTACAG CT-
Iota -CTACTACAG GTG
[uzi@quince-srv2 ~/biopython]$
Converting between sequence alignment format:
=============================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import AlignIO
>>> alignments = AlignIO.parse("PF05371_seed.sth", "stockholm")
>>> count = AlignIO.write(alignments, "PF05371_seed.aln", "clustal")
>>> print("Converted %i alignments" % count)
Converted 1 alignments
>>> count = AlignIO.convert("PF05371_seed.sth", "stockholm", "PF05371_seed.aln", "clustal")
>>> print("Converted %i alignments" % count)
Converted 1 alignments
>>> AlignIO.convert("PF05371_seed.sth", "stockholm", "PF05371_seed.phy", "phylip")
1
>>>
[uzi@quince-srv2 ~/biopython]$ cat PF05371_seed.aln
CLUSTAL X (1.81) multiple sequence alignment
COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
COATB_BPM13/24-72 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
Q9T0Q9_BPFD/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS
COATB_BPIKE/30-81 KA
Q9T0Q8_BPIKE/1-52 RA
COATB_BPI22/32-83 KA
COATB_BPM13/24-72 KA
COATB_BPZJ2/1-49 KA
Q9T0Q9_BPFD/1-49 KA
COATB_BPIF1/22-73 RA
[uzi@quince-srv2 ~/biopython]$ cat PF05371_seed.phy
7 52
COATB_BPIK AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS
Q9T0Q8_BPI AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS
COATB_BPI2 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS
COATB_BPM1 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS
COATB_BPZJ AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS
Q9T0Q9_BPF AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS
COATB_BPIF FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS
KA
RA
KA
KA
KA
KA
RA
[uzi@quince-srv2 ~/biopython]$
Multiple alignments as formatted strings:
=========================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import AlignIO
>>> alignment = AlignIO.read("PF05371_seed.sth", "stockholm")
>>> print(alignment.format("clustal"))
CLUSTAL X (1.81) multiple sequence alignment
COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
COATB_BPM13/24-72 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
Q9T0Q9_BPFD/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS
COATB_BPIKE/30-81 KA
Q9T0Q8_BPIKE/1-52 RA
COATB_BPI22/32-83 KA
COATB_BPM13/24-72 KA
COATB_BPZJ2/1-49 KA
Q9T0Q9_BPFD/1-49 KA
COATB_BPIF1/22-73 RA
>>> from StringIO import StringIO
>>>
>>> alignments = AlignIO.parse("PF05371_seed.sth", "stockholm")
>>>
>>> out_handle = StringIO()
>>> AlignIO.write(alignments, out_handle, "clustal")
1
>>> clustal_data = out_handle.getvalue()
>>>
Locally installed multiple alignment tools:
===========================================
>>> import Bio.Align.Applications
>>> dir(Bio.Align.Applications)
...
['ClustalwCommandline', 'DialignCommandline', 'MafftCommandline', 'MuscleCommandline',
'PrankCommandline', 'ProbconsCommandline', 'TCoffeeCommandline' ...]
The module Bio.Emboss.Applications has wrappers for some of the EMBOSS suite, including needle and water
Running locally installed clustalw2:
====================================
/home/opt/clustalw-2.1/src/clustalw2
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> from Bio.Align.Applications import ClustalwCommandline
>>> clustalw_exec = "/home/opt/clustalw-2.1/src/clustalw2"
>>> clustalw_cline = ClustalwCommandline(clustalw_exec, infile="opuntia.fasta")
>>> assert os.path.isfile(clustalw_exec), "Clustal W executable missing"
>>> stdout, stderr = clustalw_cline()
>>> print(stdout)
CLUSTAL 2.1 Multiple Sequence Alignments
Sequence format is Pearson
Sequence 1: gi|6273291|gb|AF191665.1|AF191665 902 bp
Sequence 2: gi|6273290|gb|AF191664.1|AF191664 899 bp
Sequence 3: gi|6273289|gb|AF191663.1|AF191663 899 bp
Sequence 4: gi|6273287|gb|AF191661.1|AF191661 895 bp
Sequence 5: gi|6273286|gb|AF191660.1|AF191660 893 bp
Sequence 6: gi|6273285|gb|AF191659.1|AF191659 894 bp
Sequence 7: gi|6273284|gb|AF191658.1|AF191658 896 bp
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 99
Sequences (1:3) Aligned. Score: 99
Sequences (1:4) Aligned. Score: 98
Sequences (1:5) Aligned. Score: 98
Sequences (1:6) Aligned. Score: 98
Sequences (1:7) Aligned. Score: 98
Sequences (2:3) Aligned. Score: 99
Sequences (2:4) Aligned. Score: 98
Sequences (2:5) Aligned. Score: 98
Sequences (2:6) Aligned. Score: 98
Sequences (2:7) Aligned. Score: 98
Sequences (3:4) Aligned. Score: 98
Sequences (3:5) Aligned. Score: 98
Sequences (3:6) Aligned. Score: 98
Sequences (3:7) Aligned. Score: 98
Sequences (4:5) Aligned. Score: 99
Sequences (4:6) Aligned. Score: 99
Sequences (4:7) Aligned. Score: 99
Sequences (5:6) Aligned. Score: 99
Sequences (5:7) Aligned. Score: 99
Sequences (6:7) Aligned. Score: 99
Guide tree file created: [opuntia.dnd]
There are 6 groups
Start of Multiple Alignment
Aligning...
Group 1: Sequences: 2 Score:16933
Group 2: Sequences: 2 Score:16703
Group 3: Sequences: 4 Score:16812
Group 4: Sequences: 2 Score:17071
Group 5: Sequences: 3 Score:16845
Group 6: Sequences: 7 Score:16678
Alignment Score 114256
CLUSTAL-Alignment file created [opuntia.aln]
>>> from Bio import AlignIO
>>> align = AlignIO.read("opuntia.aln", "clustal")
>>> print(align)
SingleLetterAlphabet() alignment with 7 rows and 906 columns
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF191
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273291|gb|AF191665.1|AF191
>>> from Bio import Phylo
>>> tree=Phylo.read("opuntia.dnd","newick")
>>> Phylo.draw_ascii(tree)
, gi|6273291|gb|AF191665.1|AF191665
|
, gi|6273290|gb|AF191664.1|AF191664
|
| gi|6273289|gb|AF191663.1|AF191663
|
___________________________________________, gi|6273287|gb|AF191661.1|AF191661
|
| gi|6273286|gb|AF191660.1|AF191660
|
, gi|6273285|gb|AF191659.1|AF191659
|
| gi|6273284|gb|AF191658.1|AF191658
>>>
[uzi@quince-srv2 ~/biopython]$ cat opuntia.dnd
(
(
gi|6273291|gb|AF191665.1|AF191665:0.00418,
(
gi|6273290|gb|AF191664.1|AF191664:0.00189,
gi|6273289|gb|AF191663.1|AF191663:0.00145)
:0.00083)
:0.00770,
(
gi|6273287|gb|AF191661.1|AF191661:0.00489,
gi|6273286|gb|AF191660.1|AF191660:0.00295)
:0.00014,
(
gi|6273285|gb|AF191659.1|AF191659:0.00094,
gi|6273284|gb|AF191658.1|AF191658:0.00018)
:0.00125);
[uzi@quince-srv2 ~/biopython]$
Running locally installed MUSCLE:
=================================
/home/opt/qiime_software/muscle-3.8.31-release/muscle
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Align.Applications import MuscleCommandline
>>> muscle_exec="/home/opt/qiime_software/muscle-3.8.31-release/muscle"
>>> import os
>>> assert os.path.isfile(muscle_exec), "muscle executable missing"
>>> cline=MuscleCommandline(muscle_exec,input="opuntia.fasta",out="opuntia.txt")
>>> print(cline)
/home/opt/qiime_software/muscle-3.8.31-release/muscle -in opuntia.fasta -out opuntia.txt
>>> cline=MuscleCommandline(muscle_exec,input="opuntia.fasta",out="opuntia.txt", clw=True)
>>> print(cline)
/home/opt/qiime_software/muscle-3.8.31-release/muscle -in opuntia.fasta -out opuntia.txt -clw
>>> cline=MuscleCommandline(muscle_exec,input="opuntia.fasta",out="opuntia.txt", clwstrict=True)
>>> print(cline)
/home/opt/qiime_software/muscle-3.8.31-release/muscle -in opuntia.fasta -out opuntia.txt -clwstrict
WHEN NOT SPECIFYING ANY OUTPUT
>>> cline=MuscleCommandline(muscle_exec,input="opuntia.fasta")
>>> stdout,stderr=cline()
>>> from StringIO import StringIO
>>> from Bio import AlignIO
>>> align = AlignIO.read(StringIO(stdout),"fasta")
>>> print(align)
SingleLetterAlphabet() alignment with 7 rows and 906 columns
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF191663
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273291|gb|AF191665.1|AF191665
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF191664
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF191661
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF191660
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF191659
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF191658
>>> import subprocess
>>> import sys
>>> child=subprocess.Popen(str(cline),stdout=subprocess.PIPE,stderr=subprocess.PIPE,shell=(sys.platform!="win32"))
>>> align=AlignIO.read(child.stdout,"fasta")
>>> print(align)
SingleLetterAlphabet() alignment with 7 rows and 906 columns
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF191663
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273291|gb|AF191665.1|AF191665
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF191664
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF191661
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF191660
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF191659
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF191658
WHEN NOT SPECIFYING ANY INPUT OR OUTPUT
>>> from Bio import SeqIO
>>> records = (r for r in SeqIO.parse("opuntia.fasta", "fasta") if len(r) < 900)
>>> cline=MuscleCommandline(clwstrict=True)
>>> cline=MuscleCommandline(muscle_exec,clwstrict=True)
>>> print(cline)
/home/opt/qiime_software/muscle-3.8.31-release/muscle -clwstrict
>>> child=subprocess.Popen(str(cline),stdin=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE,shell=(sys.platform!="win32"))
>>> SeqIO.write(records,child.stdin,"fasta")
6
>>> child.stdin.close()
>>> align=AlignIO.read(child.stdout,"clustal")
>>> print(align)
SingleLetterAlphabet() alignment with 6 rows and 900 columns
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF19166
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF19166
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF19166
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF19166
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF19165
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF19165
WHEN NOT SPECIFYING ANY INPUT OR OUTPUT (NOT USING SUBPROCESS)
>>> handle=StringIO()
>>> records = (r for r in SeqIO.parse("opuntia.fasta", "fasta") if len(r) < 900)
>>> SeqIO.write(records,handle,"fasta")
6
>>> data=handle.getvalue()
>>> print(cline)
/home/opt/qiime_software/muscle-3.8.31-release/muscle -clwstrict
>>> stdout,stderr=cline(stdin=data)
>>> align=AlignIO.read(StringIO(stdout),"clustal")
>>> print(align)
SingleLetterAlphabet() alignment with 6 rows and 900 columns
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273290|gb|AF191664.1|AF19166
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273289|gb|AF191663.1|AF19166
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273287|gb|AF191661.1|AF19166
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273286|gb|AF191660.1|AF19166
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273285|gb|AF191659.1|AF19165
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGA gi|6273284|gb|AF191658.1|AF19165
>>>
Running locally installed EMBOSS tools:
=======================================
The EMBOSS suite includes the water and needle tools for Smith-Waterman algorithm local alignment, and Needleman-Wunsch global alignment.
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Emboss.Applications import NeedleCommandline
>>> needle_cline=NeedleCommandline("needle",asequence="alpha.fasta",bsequence="beta.fasta",gapopen=10,gapextend=0.5,outfile="needle.txt")
>>> print(needle_cline)
needle -outfile=needle.txt -asequence=alpha.fasta -bsequence=beta.fasta -gapopen=10 -gapextend=0.5
>>> stdout,stderr=needle_cline()
>>> print(stdout+stderr)
Needleman-Wunsch global alignment of two sequences
>>> from Bio import AlignIO
>>> align=AlignIO.read("needle.txt","emboss")
>>> print(align)
SingleLetterAlphabet() alignment with 2 rows and 149 columns
MV-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTY...KYR HBA_HUMAN
MVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRF...KYH HBB_HUMAN
>>>
if you don’t want a temporary output file to get rid of, use stdout=True rather than the outfile argument and also to read one of the one of the inputs from stdin, use asequence="stdin"
Running BLAST online:
=====================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Blast import NCBIWWW
>>> result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")
>>> fasta_string = open("m_cold.fasta").read()
>>> result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)
>>> from Bio import SeqIO
>>> record = SeqIO.read("m_cold.fasta", format="fasta")
>>> result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
>>> record = SeqIO.read("m_cold.fasta", format="fasta")
>>> result_handle = NCBIWWW.qblast("blastn", "nt", record.format("fasta"))
>>> save_file = open("my_blast.xml", "w")
>>> save_file.write(result_handle.read())
>>> save_file.close()
>>> result_handle.close()
Note you can only read result_handle.read() once. Next time if you read it, it will return empty string
Running BLAST locally:
======================
/home/opt/ncbi-blast-2.2.28+/bin/blastx
/home/opt/ncbi-blast-2.2.28+/db/nr
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Blast.Applications import NcbiblastxCommandline
>>> blastx_exec="/home/opt/ncbi-blast-2.2.28+/bin/blastx"
>>> import os
>>> assert os.path.isfile(blastx_exec), "blastx executable missing"
>>> blastx_cline = NcbiblastxCommandline(blastx_exec,query="opuntia.fasta", db="/home/opt/ncbi-blast-2.2.28+/db/nr", evalue=0.001,outfmt=5, out="opuntia.xml")
>>> blastx_cline
NcbiblastxCommandline(cmd='/home/opt/ncbi-blast-2.2.28+/bin/blastx', out='opuntia.xml', outfmt=5, query='opuntia.fasta', db='/home/opt/ncbi-blast-2.2.28+/db/nr', evalue=0.001)
>>> print(blastx_cline)
/home/opt/ncbi-blast-2.2.28+/bin/blastx -out opuntia.xml -outfmt 5 -query opuntia.fasta -db /home/opt/ncbi-blast-2.2.28+/db/nr -evalue 0.001
>>> stdout, stderr = blastx_cline()
>>>
Parsing BLAST records:
======================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio.Blast import NCBIXML
>>> result_handle=open("my_blast.xml")
>>> blast_record=NCBIXML.read(result_handle)
>>> result_handle.close()
>>> blast_record
>>> result_handle=open("opuntia.xml")
>>> blast_records=NCBIXML.parse(result_handle)
>>> result_handle.close()
>>> blast_records
>>> E_VALUE_THRESH = 0.04
>>> for alignment in blast_record.alignments:
... for hsp in alignment.hsps:
... if hsp.expect < E_VALUE_THRESH:
... print('****Alignment****')
... print('sequence:'+ alignment.title)
... print('length:' + alignment.length)
... print('e value:' + hsp.expect)
... print(hsp.query[0:75] + '...')
... print(hsp.match[0:75] + '...')
... print(hsp.sbjct[0:75] + '...')
****Alignment****
sequence:gi|568824607|ref|XM_006466626.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X5, mRNA
length:878
e value:3.4368500000000001e-99
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA...
||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATT...
****Alignment****
sequence:gi|568824605|ref|XM_006466625.1| PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X4, mRNA
length:911
e value:3.4368500000000001e-99
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAATATGATCGATTCCGATA...
||||||||||| |||| || ||||| |||||||||||| | || | || | ||||| || ||| ...
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAGTTGATCAGCTCTGATT...
...
Experimental Bio.SearchIO:
==========================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SearchIO
/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release.
BiopythonExperimentalWarning)
>>> blast_qresult=SearchIO.read("my_blast.xml","blast-xml")
>>> print(blast_qresult)
Program: blastn (2.2.29+)
Query: 40259 (1111)
gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthemum cryst...
Target: nt
Hits: ---- ----- ----------------------------------------------------------
# # HSP ID + description
---- ----- ----------------------------------------------------------
0 1 gi|568824607|ref|XM_006466626.1| PREDICTED: Citrus sin...
1 1 gi|568824605|ref|XM_006466625.1| PREDICTED: Citrus sin...
2 1 gi|568824603|ref|XM_006466624.1| PREDICTED: Citrus sin...
3 1 gi|568824601|ref|XM_006466623.1| PREDICTED: Citrus sin...
4 1 gi|568824599|ref|XM_006466622.1| PREDICTED: Citrus sin...
5 1 gi|567866318|ref|XM_006425719.1| Citrus clementina hyp...
6 1 gi|567866316|ref|XM_006425718.1| Citrus clementina hyp...
7 1 gi|567866314|ref|XM_006425717.1| Citrus clementina hyp...
8 1 gi|567866312|ref|XM_006425716.1| Citrus clementina hyp...
9 1 gi|224094601|ref|XM_002310151.1| Populus trichocarpa p...
10 1 gi|566180892|ref|XM_006380679.1| Populus trichocarpa c...
11 1 gi|470129329|ref|XM_004300526.1| PREDICTED: Fragaria v...
12 1 gi|359495761|ref|XM_002274845.2| PREDICTED: Vitis vini...
13 1 gi|349709091|emb|FQ378501.1| Vitis vinifera clone SS0A...
14 1 gi|255562758|ref|XM_002522339.1| Ricinus communis COR4...
15 1 gi|565355903|ref|XM_006344753.1| PREDICTED: Solanum tu...
16 1 gi|502153055|ref|XM_004509143.1| PREDICTED: Cicer arie...
17 1 gi|358346403|ref|XM_003637210.1| Medicago truncatula C...
18 1 gi|358344000|ref|XM_003636035.1| Medicago truncatula C...
19 1 gi|571531434|ref|XM_003548859.2| PREDICTED: Glycine ma...
20 1 gi|449448541|ref|XM_004141977.1| PREDICTED: Cucumis sa...
21 1 gi|571438949|ref|XM_003519866.2| PREDICTED: Glycine ma...
22 1 gi|460375241|ref|XM_004233368.1| PREDICTED: Solanum ly...
23 1 gi|225311746|dbj|AK326681.1| Solanum lycopersicum cDNA...
24 1 gi|255762732|gb|GQ370517.1| Salvia miltiorrhiza cold a...
25 1 gi|225428595|ref|XM_002284686.1| PREDICTED: Vitis vini...
26 1 gi|568851338|ref|XM_006479288.1| PREDICTED: Citrus sin...
27 1 gi|567861017|ref|XM_006423100.1| Citrus clementina hyp...
28 1 gi|297819785|ref|XM_002877730.1| Arabidopsis lyrata su...
29 1 gi|86755971|gb|DQ359747.1| Chimonanthus praecox cold a...
~~~
47 1 gi|242389633|emb|FP100664.1| Phyllostachys edulis cDNA...
48 1 gi|242382816|emb|FP092058.1| Phyllostachys edulis cDNA...
49 1 gi|573926239|ref|XM_006650591.1| PREDICTED: Oryza brac...
>>> blat_qresult = SearchIO.read('my_blat.psl', 'blat-psl')
>>> print(blat_qresult)
Program: blat ()
Query: mystery_seq (61)
Target:
Hits: ---- ----- ----------------------------------------------------------
# # HSP ID + description
---- ----- ----------------------------------------------------------
0 17 chr19
>>> dir(blast_qresult)
['_NON_STICKY_ATTRS', '_QueryResult__marker', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_blast_id', '_description', '_hit_key_function', '_id', '_items', '_transfer_attrs', 'absorb', 'append', 'description', 'fragments', 'hit_filter', 'hit_keys', 'hit_map', 'hits', 'hsp_filter', 'hsp_map', 'hsps', 'id', 'index', 'items', 'iterhit_keys', 'iterhits', 'iteritems', 'param_evalue_threshold', 'param_filter', 'param_gap_extend', 'param_gap_open', 'param_score_match', 'param_score_mismatch', 'pop', 'program', 'reference', 'seq_len', 'sort', 'stat_db_len', 'stat_db_num', 'stat_eff_space', 'stat_entropy', 'stat_hsp_len', 'stat_kappa', 'stat_lambda', 'target', 'version']
>>> for hit in blast_qresult:
... hit
...
Hit(id='gi|568824607|ref|XM_006466626.1|', query_id='40259', 1 hsps)
Hit(id='gi|568824605|ref|XM_006466625.1|', query_id='40259', 1 hsps)
Hit(id='gi|568824603|ref|XM_006466624.1|', query_id='40259', 1 hsps)
Hit(id='gi|568824601|ref|XM_006466623.1|', query_id='40259', 1 hsps)
Hit(id='gi|568824599|ref|XM_006466622.1|', query_id='40259', 1 hsps)
Hit(id='gi|567866318|ref|XM_006425719.1|', query_id='40259', 1 hsps)
Hit(id='gi|567866316|ref|XM_006425718.1|', query_id='40259', 1 hsps)
Hit(id='gi|567866314|ref|XM_006425717.1|', query_id='40259', 1 hsps)
Hit(id='gi|567866312|ref|XM_006425716.1|', query_id='40259', 1 hsps)
Hit(id='gi|224094601|ref|XM_002310151.1|', query_id='40259', 1 hsps)
Hit(id='gi|566180892|ref|XM_006380679.1|', query_id='40259', 1 hsps)
Hit(id='gi|470129329|ref|XM_004300526.1|', query_id='40259', 1 hsps)
Hit(id='gi|359495761|ref|XM_002274845.2|', query_id='40259', 1 hsps)
Hit(id='gi|349709091|emb|FQ378501.1|', query_id='40259', 1 hsps)
Hit(id='gi|255562758|ref|XM_002522339.1|', query_id='40259', 1 hsps)
Hit(id='gi|565355903|ref|XM_006344753.1|', query_id='40259', 1 hsps)
Hit(id='gi|502153055|ref|XM_004509143.1|', query_id='40259', 1 hsps)
Hit(id='gi|358346403|ref|XM_003637210.1|', query_id='40259', 1 hsps)
Hit(id='gi|358344000|ref|XM_003636035.1|', query_id='40259', 1 hsps)
Hit(id='gi|571531434|ref|XM_003548859.2|', query_id='40259', 1 hsps)
Hit(id='gi|449448541|ref|XM_004141977.1|', query_id='40259', 1 hsps)
Hit(id='gi|571438949|ref|XM_003519866.2|', query_id='40259', 1 hsps)
Hit(id='gi|460375241|ref|XM_004233368.1|', query_id='40259', 1 hsps)
Hit(id='gi|225311746|dbj|AK326681.1|', query_id='40259', 1 hsps)
Hit(id='gi|255762732|gb|GQ370517.1|', query_id='40259', 1 hsps)
Hit(id='gi|225428595|ref|XM_002284686.1|', query_id='40259', 1 hsps)
Hit(id='gi|568851338|ref|XM_006479288.1|', query_id='40259', 1 hsps)
Hit(id='gi|567861017|ref|XM_006423100.1|', query_id='40259', 1 hsps)
Hit(id='gi|297819785|ref|XM_002877730.1|', query_id='40259', 1 hsps)
Hit(id='gi|86755971|gb|DQ359747.1|', query_id='40259', 1 hsps)
Hit(id='gi|565467969|ref|XM_006291803.1|', query_id='40259', 1 hsps)
Hit(id='gi|145339339|ref|NM_114943.4|', query_id='40259', 1 hsps)
Hit(id='gi|15810634|gb|AY056356.1|', query_id='40259', 1 hsps)
Hit(id='gi|10121842|gb|AF283005.1|', query_id='40259', 1 hsps)
Hit(id='gi|13430785|gb|AF360305.1|', query_id='40259', 1 hsps)
Hit(id='gi|60317457|gb|AY761065.1|', query_id='40259', 1 hsps)
Hit(id='gi|255556172|ref|XM_002519075.1|', query_id='40259', 1 hsps)
Hit(id='gi|156567558|gb|EU077497.1|', query_id='40259', 1 hsps)
Hit(id='gi|567188085|ref|XM_006403931.1|', query_id='40259', 1 hsps)
Hit(id='gi|449502385|ref|XM_004161578.1|', query_id='40259', 1 hsps)
Hit(id='gi|449455211|ref|XM_004145299.1|', query_id='40259', 1 hsps)
Hit(id='gi|449474965|ref|XM_004154286.1|', query_id='40259', 1 hsps)
Hit(id='gi|46577795|gb|AY587773.1|', query_id='40259', 1 hsps)
Hit(id='gi|305690597|gb|HQ010041.1|', query_id='40259', 1 hsps)
Hit(id='gi|566187445|ref|XM_002313788.2|', query_id='40259', 1 hsps)
Hit(id='gi|470148635|ref|XM_004309813.1|', query_id='40259', 1 hsps)
Hit(id='gi|357528757|gb|GU370376.1|', query_id='40259', 1 hsps)
Hit(id='gi|242389633|emb|FP100664.1|', query_id='40259', 1 hsps)
Hit(id='gi|242382816|emb|FP092058.1|', query_id='40259', 1 hsps)
Hit(id='gi|573926239|ref|XM_006650591.1|', query_id='40259', 1 hsps)
>>> len(blast_qresult)
50
>>> blast_qresult[0]
Hit(id='gi|568824607|ref|XM_006466626.1|', query_id='40259', 1 hsps)
>>> blast_qresult[-1]
Hit(id='gi|573926239|ref|XM_006650591.1|', query_id='40259', 1 hsps)
>>> blast_slice=blast_qresult[:3]
>>> print(blast_slice)
Program: blastn (2.2.29+)
Query: 40259 (1111)
gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthemum cryst...
Target: nt
Hits: ---- ----- ----------------------------------------------------------
# # HSP ID + description
---- ----- ----------------------------------------------------------
0 1 gi|568824607|ref|XM_006466626.1| PREDICTED: Citrus sin...
1 1 gi|568824605|ref|XM_006466625.1| PREDICTED: Citrus sin...
2 1 gi|568824603|ref|XM_006466624.1| PREDICTED: Citrus sin...
>>> blast_qresult['gi|568824607|ref|XM_006466626.1|']
Hit(id='gi|568824607|ref|XM_006466626.1|', query_id='40259', 1 hsps)
>>> blast_qresult.hits
[Hit(id='gi|568824607|ref|XM_006466626.1|', query_id='40259', 1 hsps), Hit(id='gi|568824605|ref|XM_006466625.1|', query_id='40259', 1 hsps), Hit(id='gi|568824603|ref|XM_006466624.1|', query_id='40259', 1 hsps), Hit(id='gi|568824601|ref|XM_006466623.1|', query_id='40259', 1 hsps), Hit(id='gi|568824599|ref|XM_006466622.1|', query_id='40259', 1 hsps), Hit(id='gi|567866318|ref|XM_006425719.1|', query_id='40259', 1 hsps), Hit(id='gi|567866316|ref|XM_006425718.1|', query_id='40259', 1 hsps), Hit(id='gi|567866314|ref|XM_006425717.1|', query_id='40259', 1 hsps), Hit(id='gi|567866312|ref|XM_006425716.1|', query_id='40259', 1 hsps), Hit(id='gi|224094601|ref|XM_002310151.1|', query_id='40259', 1 hsps), Hit(id='gi|566180892|ref|XM_006380679.1|', query_id='40259', 1 hsps), Hit(id='gi|470129329|ref|XM_004300526.1|', query_id='40259', 1 hsps), Hit(id='gi|359495761|ref|XM_002274845.2|', query_id='40259', 1 hsps), Hit(id='gi|349709091|emb|FQ378501.1|', query_id='40259', 1 hsps), Hit(id='gi|255562758|ref|XM_002522339.1|', query_id='40259', 1 hsps), Hit(id='gi|565355903|ref|XM_006344753.1|', query_id='40259', 1 hsps), Hit(id='gi|502153055|ref|XM_004509143.1|', query_id='40259', 1 hsps), Hit(id='gi|358346403|ref|XM_003637210.1|', query_id='40259', 1 hsps), Hit(id='gi|358344000|ref|XM_003636035.1|', query_id='40259', 1 hsps), Hit(id='gi|571531434|ref|XM_003548859.2|', query_id='40259', 1 hsps), Hit(id='gi|449448541|ref|XM_004141977.1|', query_id='40259', 1 hsps), Hit(id='gi|571438949|ref|XM_003519866.2|', query_id='40259', 1 hsps), Hit(id='gi|460375241|ref|XM_004233368.1|', query_id='40259', 1 hsps), Hit(id='gi|225311746|dbj|AK326681.1|', query_id='40259', 1 hsps), Hit(id='gi|255762732|gb|GQ370517.1|', query_id='40259', 1 hsps), Hit(id='gi|225428595|ref|XM_002284686.1|', query_id='40259', 1 hsps), Hit(id='gi|568851338|ref|XM_006479288.1|', query_id='40259', 1 hsps), Hit(id='gi|567861017|ref|XM_006423100.1|', query_id='40259', 1 hsps), Hit(id='gi|297819785|ref|XM_002877730.1|', query_id='40259', 1 hsps), Hit(id='gi|86755971|gb|DQ359747.1|', query_id='40259', 1 hsps), Hit(id='gi|565467969|ref|XM_006291803.1|', query_id='40259', 1 hsps), Hit(id='gi|145339339|ref|NM_114943.4|', query_id='40259', 1 hsps), Hit(id='gi|15810634|gb|AY056356.1|', query_id='40259', 1 hsps), Hit(id='gi|10121842|gb|AF283005.1|', query_id='40259', 1 hsps), Hit(id='gi|13430785|gb|AF360305.1|', query_id='40259', 1 hsps), Hit(id='gi|60317457|gb|AY761065.1|', query_id='40259', 1 hsps), Hit(id='gi|255556172|ref|XM_002519075.1|', query_id='40259', 1 hsps), Hit(id='gi|156567558|gb|EU077497.1|', query_id='40259', 1 hsps), Hit(id='gi|567188085|ref|XM_006403931.1|', query_id='40259', 1 hsps), Hit(id='gi|449502385|ref|XM_004161578.1|', query_id='40259', 1 hsps), Hit(id='gi|449455211|ref|XM_004145299.1|', query_id='40259', 1 hsps), Hit(id='gi|449474965|ref|XM_004154286.1|', query_id='40259', 1 hsps), Hit(id='gi|46577795|gb|AY587773.1|', query_id='40259', 1 hsps), Hit(id='gi|305690597|gb|HQ010041.1|', query_id='40259', 1 hsps), Hit(id='gi|566187445|ref|XM_002313788.2|', query_id='40259', 1 hsps), Hit(id='gi|470148635|ref|XM_004309813.1|', query_id='40259', 1 hsps), Hit(id='gi|357528757|gb|GU370376.1|', query_id='40259', 1 hsps), Hit(id='gi|242389633|emb|FP100664.1|', query_id='40259', 1 hsps), Hit(id='gi|242382816|emb|FP092058.1|', query_id='40259', 1 hsps), Hit(id='gi|573926239|ref|XM_006650591.1|', query_id='40259', 1 hsps)]
>>> blast_qresult.hit_keys
['gi|568824607|ref|XM_006466626.1|', 'gi|568824605|ref|XM_006466625.1|', 'gi|568824603|ref|XM_006466624.1|', 'gi|568824601|ref|XM_006466623.1|', 'gi|568824599|ref|XM_006466622.1|', 'gi|567866318|ref|XM_006425719.1|', 'gi|567866316|ref|XM_006425718.1|', 'gi|567866314|ref|XM_006425717.1|', 'gi|567866312|ref|XM_006425716.1|', 'gi|224094601|ref|XM_002310151.1|', 'gi|566180892|ref|XM_006380679.1|', 'gi|470129329|ref|XM_004300526.1|', 'gi|359495761|ref|XM_002274845.2|', 'gi|349709091|emb|FQ378501.1|', 'gi|255562758|ref|XM_002522339.1|', 'gi|565355903|ref|XM_006344753.1|', 'gi|502153055|ref|XM_004509143.1|', 'gi|358346403|ref|XM_003637210.1|', 'gi|358344000|ref|XM_003636035.1|', 'gi|571531434|ref|XM_003548859.2|', 'gi|449448541|ref|XM_004141977.1|', 'gi|571438949|ref|XM_003519866.2|', 'gi|460375241|ref|XM_004233368.1|', 'gi|225311746|dbj|AK326681.1|', 'gi|255762732|gb|GQ370517.1|', 'gi|225428595|ref|XM_002284686.1|', 'gi|568851338|ref|XM_006479288.1|', 'gi|567861017|ref|XM_006423100.1|', 'gi|297819785|ref|XM_002877730.1|', 'gi|86755971|gb|DQ359747.1|', 'gi|565467969|ref|XM_006291803.1|', 'gi|145339339|ref|NM_114943.4|', 'gi|15810634|gb|AY056356.1|', 'gi|10121842|gb|AF283005.1|', 'gi|13430785|gb|AF360305.1|', 'gi|60317457|gb|AY761065.1|', 'gi|255556172|ref|XM_002519075.1|', 'gi|156567558|gb|EU077497.1|', 'gi|567188085|ref|XM_006403931.1|', 'gi|449502385|ref|XM_004161578.1|', 'gi|449455211|ref|XM_004145299.1|', 'gi|449474965|ref|XM_004154286.1|', 'gi|46577795|gb|AY587773.1|', 'gi|305690597|gb|HQ010041.1|', 'gi|566187445|ref|XM_002313788.2|', 'gi|470148635|ref|XM_004309813.1|', 'gi|357528757|gb|GU370376.1|', 'gi|242389633|emb|FP100664.1|', 'gi|242382816|emb|FP092058.1|', 'gi|573926239|ref|XM_006650591.1|']
>>> 'gi|565467969|ref|XM_006291803.1|' in blast_qresult
True
>>> 'gi|565467969|ref|XM_006291803.1|' in blast_qresult
True
>>> for hit in blast_qresult[:5]:
... print("%s %i" % (hit.id, hit.seq_len))
...
gi|568824607|ref|XM_006466626.1| 878
gi|568824605|ref|XM_006466625.1| 911
gi|568824603|ref|XM_006466624.1| 894
gi|568824601|ref|XM_006466623.1| 922
gi|568824599|ref|XM_006466622.1| 966
>>> sort_key = lambda hit: hit.seq_len
>>> sorted_qresult = blast_qresult.sort(key=sort_key, reverse=True, in_place=False)
>>> for hit in sorted_qresult[:5]:
... print("%s %i" % (hit.id, hit.seq_len))...
gi|567866316|ref|XM_006425718.1| 1202
gi|566180892|ref|XM_006380679.1| 1182
gi|46577795|gb|AY587773.1| 1176
gi|567188085|ref|XM_006403931.1| 1171
gi|565355903|ref|XM_006344753.1| 1153
>>>
The advantage of having the in_place flag here is that we’re preserving the native ordering, so we may use it again later. You should note that this is not the default behavior of QueryResult.sort, however, which is why we needed to set the in_place flag to True explicitly.
>>> filter_func = lambda hit: len(hit.hsps) > 1
>>> filtered_qresult = blast_qresult.hit_filter(filter_func)
>>> len(filtered_qresult)
0
>>> blast_qresult
QueryResult(id='40259', 50 hits)
>>> def map_func(hit):
... hit.id = hit.id.split('|')[3] # renames 'gi|301171322|ref|NR_035857.1|' to 'NR_035857.1'
... return hit
...
>>> mapped_qresult = blast_qresult.hit_map(map_func)
>>> for hit in mapped_qresult[:5]:
... print(hit.id)
...
XM_006466626.1
XM_006466625.1
XM_006466624.1
XM_006466623.1
XM_006466622.1
>>>
Bio.SearchIO Hit:
=================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SearchIO
/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release.
BiopythonExperimentalWarning)
>>> blast_qresult = SearchIO.read('my_blast.xml', 'blast-xml')
>>> blast_hit = blast_qresult[3]
>>> print(blast_hit)
Query: 40259
gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthemum crystal...
Hit: gi|568824601|ref|XM_006466623.1| (922)
PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein...
HSPs: ---- -------- --------- ------ --------------- ---------------------
# E-value Bit score Span Query range Hit range
---- -------- --------- ------ --------------- ---------------------
0 3.4e-99 372.78 596 [63:655] [136:725]
>>> blat_qresult = SearchIO.read('my_blat.psl', 'blat-psl')
>>> blat_hit = blat_qresult[0]
>>> print(blat_hit)
Query: mystery_seq
Hit: chr19 (59128983)
HSPs: ---- -------- --------- ------ --------------- ---------------------
# E-value Bit score Span Query range Hit range
---- -------- --------- ------ --------------- ---------------------
0 ? ? ? [0:61] [54204480:54204541]
1 ? ? ? [0:61] [54233104:54264463]
2 ? ? ? [0:61] [54254477:54260071]
3 ? ? ? [1:61] [54210720:54210780]
4 ? ? ? [0:60] [54198476:54198536]
5 ? ? ? [0:61] [54265610:54265671]
6 ? ? ? [0:61] [54238143:54240175]
7 ? ? ? [0:60] [54189735:54189795]
8 ? ? ? [0:61] [54185425:54185486]
9 ? ? ? [0:60] [54197657:54197717]
10 ? ? ? [0:61] [54255662:54255723]
11 ? ? ? [0:61] [54201651:54201712]
12 ? ? ? [8:60] [54206009:54206061]
13 ? ? ? [10:61] [54178987:54179038]
14 ? ? ? [8:61] [54212018:54212071]
15 ? ? ? [8:51] [54234278:54234321]
16 ? ? ? [8:61] [54238143:54238196]
>>> for hsp in blast_hit:
... hsp
...
HSP(hit_id='gi|568824601|ref|XM_006466623.1|', query_id='40259', 1 fragments)
>>> blat_hit[0]
HSP(hit_id='chr19', query_id='mystery_seq', 1 fragments)
>>> sliced_hit=blat_hit[4:9]
>>> print(sliced_hit)
Query: mystery_seq
Hit: chr19 (59128983)
HSPs: ---- -------- --------- ------ --------------- ---------------------
# E-value Bit score Span Query range Hit range
---- -------- --------- ------ --------------- ---------------------
0 ? ? ? [0:60] [54198476:54198536]
1 ? ? ? [0:61] [54265610:54265671]
2 ? ? ? [0:61] [54238143:54240175]
3 ? ? ? [0:60] [54189735:54189795]
4 ? ? ? [0:61] [54185425:54185486]
>>>
Bio.SearchIO HSP:
=================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SearchIO
/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release.
BiopythonExperimentalWarning)
>>> blast_qresult = SearchIO.read('my_blast.xml', 'blast-xml')
>>> blast_hsp = blast_qresult[0][0] # first hit, first hsp
>>> print(blast_hsp)
Query: 40259 gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthe...
Hit: gi|568824607|ref|XM_006466626.1| PREDICTED: Citrus sinensis cold...
Query range: [63:655] (1)
Hit range: [92:681] (1)
Quick stats: evalue 3.4e-99; bitscore 372.78
Fragments: 1 (596 columns)
Query - AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAAT~~~GTCTG
||||||||||| |||| || ||||| |||||||||||| | || | || | ~~~|||||
Hit - AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAG~~~GTCTG
>>> blast_hsp.query_range
(63, 655)
>>> blast_hsp.evalue
3.4368500000000001e-99
>>> blast_hsp.hit_start
92
>>> blast_hsp.query_span
592
>>> blast_hsp.aln_span
596
>>> blast_hsp.gap_num
11
>>> blast_hsp.ident_num
443
>>> blast_hsp.__dict__.keys()
['bitscore', 'evalue', 'ident_num', 'gap_num', 'bitscore_raw', 'pos_num', '_items']
>>> blast_hsp.query
SeqRecord(seq=Seq('AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGG...CTG', DNAAlphabet()), id='40259', name='aligned query sequence', description="gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthemum crystallinum cDNA 5' similar to cold acclimation protein, mRNA sequence", dbxrefs=[])
>>> blast_hsp.hit
SeqRecord(seq=Seq('AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAG...CTG', DNAAlphabet()), id='gi|568824607|ref|XM_006466626.1|', name='aligned hit sequence', description='PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X5, mRNA', dbxrefs=[])
>>> print(blast_hsp.aln)
DNAAlphabet() alignment with 2 rows and 596 columns
AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAA...CTG 40259
AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGAT...CTG gi|568824607|ref|XM_006466626.1|
>>> blat_qresult = SearchIO.read('my_blat.psl', 'blat-psl')
>>> blat_hsp = blat_qresult[0][0] # first hit, first hsp
>>> print(blat_hsp)
Query: mystery_seq
Hit: chr19
Query range: [0:61] (1)
Hit range: [54204480:54204541] (1)
Quick stats: evalue ?; bitscore ?
Fragments: 1 (? columns)
>>> blat_hsp.hit
>>> blat_hsp.hit is None
True
>>> blat_hsp.query is None
True
>>> blat_hsp.aln is None
True
>>> blat_hsp.query_span
61
>>> blat_hsp.hit_span
61
>>> blat_hsp.score
61
>>> blat_hsp.mismatch_num
0
>>> blat_hsp2=blat_qresult[0][1]
>>> print(blat_hsp2)
Query: mystery_seq
Hit: chr19
Query range: [0:61] (1)
Hit range: [54233104:54264463] (1)
Quick stats: evalue ?; bitscore ?
Fragments: --- -------------- ---------------------- ----------------------
# Span Query range Hit range
--- -------------- ---------------------- ----------------------
0 ? [0:18] [54233104:54233122]
1 ? [18:61] [54264420:54264463]
>>> blat_hsp2.hit_range
(54233104, 54264463)
>>> blat_hsp2.hit_range_all
[(54233104, 54233122), (54264420, 54264463)]
>>> blat_hsp2.hit_span
31359
>>> blat_hsp2.hit_span_all
[18, 43]
>>> blat_hsp2.hit_inter_ranges
[(54233122, 54264420)]
>>> blat_hsp2.hit_inter_spans
[31298]
>>> blat_hsp2.is_fragmented
True
>>> blat_hsp.is_fragmented
False
>>>
Bio.SearchIO HSPFragment:
=========================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SearchIO
/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release.
BiopythonExperimentalWarning)
>>> blast_qresult=SearchIO.read('my_blast.xml','blast-xml')
>>> blast_frag = blast_qresult[0][0][0] # first hit, first hsp, first fragment
>>> print(blast_frag)
Query: 40259 gi|8332116|gb|BE037100.1|BE037100 MP14H09 MP Mesembryanthe...
Hit: gi|568824607|ref|XM_006466626.1| PREDICTED: Citrus sinensis cold...
Query range: [63:655] (1)
Hit range: [92:681] (1)
Fragments: 1 (596 columns)
Query - AAATGGGGAGAGAAATGAAGTACTTGGCCATGAAAACTGATCAATTGGCCGTGGCTAAT~~~GTCTG
||||||||||| |||| || ||||| |||||||||||| | || | || | ~~~|||||
Hit - AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAGCAGAG~~~GTCTG
>>> blat_qresult = SearchIO.read('my_blat.psl', 'blat-psl')
>>> blat_frag = blat_qresult[0][0][0] # first hit, first hsp, first fragment
>>> print(blat_frag)
Query: mystery_seq
Hit: chr19
Query range: [0:61] (1)
Hit range: [54204480:54204541] (1)
Fragments: 1 (? columns)
>>> blast_frag.query_start
63
>>> blast_frag.hit_strand
1
>>> blast_frag.hit
SeqRecord(seq=Seq('AAATGGGGAGAT---TGAATTATTTGGCTATGAAAACTGATGATCAGGTTGCAG...CTG', DNAAlphabet()), id='gi|568824607|ref|XM_006466626.1|', name='aligned hit sequence', description='PREDICTED: Citrus sinensis cold-regulated 413 plasma membrane protein 2-like (LOC102620025), transcript variant X5, mRNA', dbxrefs=[])
>>>
Bio.SearchIO tab-delimited files:
=================================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SearchIO
/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release.
BiopythonExperimentalWarning)
>>> qresult = SearchIO.read('tab_2226_tblastn_003.txt', 'blast-tab')
>>> qresult
QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits)
>>> qresult2 = SearchIO.read('tab_2226_tblastn_007.txt', 'blast-tab', comments=True)
>>> qresult2
QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits)
>>> for qresult in qresult:
... print(qresult.id)
...
gi|145479850|ref|XM_001425911.1|
gi|72012412|ref|XM_777959.1|
gi|115975252|ref|XM_001180111.1|
>>>
One can similarly use SearchIO.index and SearchIO.index_db (storing in sqlite3 database)
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SearchIO
/usr/lib64/python2.6/site-packages/biopython-1.62-py2.6-linux-x86_64.egg/Bio/SearchIO/__init__.py:213: BiopythonExperimentalWarning: Bio.SearchIO is an experimental submodule which may undergo significant changes prior to its future official release.
BiopythonExperimentalWarning)
>>> idx = SearchIO.index('tab_2226_tblastn_001.txt', 'blast-tab')
>>> sorted(idx.keys())
['gi|11464971:4-101', 'gi|16080617|ref|NP_391444.1|']
>>> idx['gi|16080617|ref|NP_391444.1|']
QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits)
>>> idx = SearchIO.index('tab_2226_tblastn_005.txt', 'blast-tab', comments=True)
>>> sorted(idx.keys())
['gi|11464971:4-101', 'gi|16080617|ref|NP_391444.1|', 'random_s00']
>>> idx['gi|16080617|ref|NP_391444.1|']
QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits)
>>> key_function = lambda id: id.upper() # capitalizes the keys
>>> idx = SearchIO.index('tab_2226_tblastn_001.txt', 'blast-tab', key_function=key_function)
>>> sorted(idx.keys())
['GI|11464971:4-101', 'GI|16080617|REF|NP_391444.1|']
>>> idx['GI|16080617|REF|NP_391444.1|']
QueryResult(id='gi|16080617|ref|NP_391444.1|', 3 hits)
>>> qresults=SearchIO.parse('mirna.xml','blast-xml')
>>> SearchIO.write(qresults,'results.tab','blast-tab')
(3, 239, 277, 277)
>>> SearchIO.convert('mirna.xml', 'blast-xml', 'results.tab', 'blast-tab')
(3, 239, 277, 277)
GenomeDiagram package:
======================
[uzi@quince-srv2 ~/biopython]$ python
Python 2.6.6 (r266:84292, Nov 22 2013, 12:16:22)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from reportlab.lib import colors
>>> from reportlab.lib.units import cm
>>> from Bio.Graphics import GenomeDiagram
>>> from Bio import SeqIO
>>> record=SeqIO.read("NC_005816.gb","genbank")
>>> gd_diagram = GenomeDiagram.Diagram("Yersinia pestis biovar Microtus plasmid pPCP1")
>>> gd_track_for_features = gd_diagram.new_track(1, name="Annotated Features")>>> gd_feature_set = gd_track_for_features.new_set()
>>> for feature in record.features:
... if feature.type != "gene":
... #Exclude this feature
... continue
... if len(gd_feature_set) % 2 == 0:
... color = colors.blue
... else:
... color = colors.lightblue
... gd_feature_set.add_feature(feature,color=color,label=True)
...
>>> gd_diagram.draw(format="linear",orientation="landscape",pagesize='A4',fragments=4,start=0,end=len(record))
>>> gd_diagram.write("plasmid_linear.pdf","PDF")
>>> gd_diagram.write("plasmid_linear.eps","EPS")
>>> gd_diagram.write("plasmid_linear.svg","SVG")
>>> gd_diagram.write("plasmid_linear.png","PNG")
>>> gd_diagram.draw(format="circular",circular=True,pagesize=(20*cm,20*cm), start=0,end=len(record),circle_core=0.7)
>>> gd_diagram.write("plasmid_circular.pdf","PDF")