Fancy One-liners
by Umer Zeeshan Ijaz and Chris Quince
This webpage lists some of the one-liners that we frequently use in metagenomic analyses. You can click on the following links to browse through different topics. You can copy/paste the commands as they are in your terminal screen, provided you follow the same naming conventions and folder structures as we have. We are sharing these codes with the intention that if they are useful and help you in your analyses, then we will be appropriately credited as considerable effort has been put into devising them.
Perl one-liners
Extracting information from GBK files
Identifying duplicates in two FASTA files (awk)
Converting "Sample[TAB]Feature[TAB]Abundance" list to tab-delimited abundance table
Dereplicating Reads
Paired-end assembler
Generating abundance tables and trees from CREST and RDP classifiers
Spatial comparison of read qualities between different sequencing runs (Illumina paired-end reads)
Subsampling FASTA and FASTQ files
Identifying low-coverage genomes in metagenomic shotgun sequencing samples (mock communities)
Variant Calling
Getting linkage information (reads that map to multiple contigs/genomes) from SAM files
Does subsampling improve assembly quality for very high coverage datasets when using Velvet without removing noise from reads?
Extracting 16S rRNA sequences from NCBI's locally installed nt database using blastdbcmd
Resolving NCBI Taxonomy using BioSQL
Printing all FASTA headers
Perl reads each line into $_ variable thanks to -n option, then it calls print statement if the line starts with >. You can exclude $_ in the print statement:
[uzi@quince-srv2 ~/oneliners$ perl -ne 'print if /^>/' ABKL02.fsa
>gi|219671117|gb|ABKL02000001.1| Clostridium difficile QCD-23m63 contig00001_2, whole genome shotgun sequence
>gi|219671116|gb|ABKL02000002.1| Clostridium difficile QCD-23m63 contig00002_2, whole genome shotgun sequence
>gi|219671115|gb|ABKL02000003.1| Clostridium difficile QCD-23m63 contig00003_2, whole genome shotgun sequence
>gi|219671114|gb|ABKL02000004.1| Clostridium difficile QCD-23m63 contig00004_2, whole genome shotgun sequence
>gi|219671113|gb|ABKL02000005.1| Clostridium difficile QCD-23m63 contig00005_2, whole genome shotgun sequence
>gi|219671112|gb|ABKL02000006.1| Clostridium difficile QCD-23m63 contig00006_2, whole genome shotgun sequence
>gi|219671111|gb|ABKL02000007.1| Clostridium difficile QCD-23m63 contig00007_2, whole genome shotgun sequence
>gi|219671110|gb|ABKL02000008.1| Clostridium difficile QCD-23m63 contig00008_2, whole genome shotgun sequence
>gi|219671109|gb|ABKL02000009.1| Clostridium difficile QCD-23m63 contig00009_2, whole genome shotgun sequence
...
...
Printing FASTA headers with line numbers
This one-liner uses the $. special variable. This variable stands for "current line number." Each time Perl reads in the next line, it increments $. by one
[uzi@quince-srv2 ~/oneliners$ perl -ne 'print "$. $_" if /^>/ ' ABKL02.fsa
1 >gi|219671117|gb|ABKL02000001.1| Clostridium difficile QCD-23m63 contig00001_2, whole genome shotgun sequence
133 >gi|219671116|gb|ABKL02000002.1| Clostridium difficile QCD-23m63 contig00002_2, whole genome shotgun sequence
1474 >gi|219671115|gb|ABKL02000003.1| Clostridium difficile QCD-23m63 contig00003_2, whole genome shotgun sequence
1486 >gi|219671114|gb|ABKL02000004.1| Clostridium difficile QCD-23m63 contig00004_2, whole genome shotgun sequence
1564 >gi|219671113|gb|ABKL02000005.1| Clostridium difficile QCD-23m63 contig00005_2, whole genome shotgun sequence
2566 >gi|219671112|gb|ABKL02000006.1| Clostridium difficile QCD-23m63 contig00006_2, whole genome shotgun sequence
2904 >gi|219671111|gb|ABKL02000007.1| Clostridium difficile QCD-23m63 contig00007_2, whole genome shotgun sequence
4311 >gi|219671110|gb|ABKL02000008.1| Clostridium difficile QCD-23m63 contig00008_2, whole genome shotgun sequence
...
...
Printing top 5 lines of the abundance table (emulating head -5)
[uzi@quince-srv2 ~/oneliners$ perl -ne 'print if $. <= 5' phylum.csv
Samples,S113_TAGAGCTGCCAT,S104_CCGAACGTCACT,S61_CTCATCATGTTC,S339_ACACCTGCGATC,S77_CACGATGGTCAT,S266_ACAGGAGGGTGT,S337_AACGAGGCAACG,S262_TTCTAGAGTGCG,S145_CAACGCTAGAAT,S11_TGGAAGAACGGC,S208_AGATGTCCGTCA,S108_GAACCTATGACA,S179_CTGCATACTGAG,S157_CATAAGGGAGGC,S163_GTGTGCTAACGT,S333_GCCTTACGATAG,S158_TGTGTTACTCCT,S329_CAATGTAGACAC,S23_CCTCTGAGAGCT,S97_GACGCTTTGCTG,S209_GCACCTGTTGAA,S68_CGAAGCATCTAC,S275_GCAAATCAGCCT,S13_AACCCAGATGAT,S147_TCTGTAGAGCCA,S178_TTGGTCTCCTCT,S183_AGCAGCTATTGC,S26_CGTGCACAATTG,S274_AGCACCGGTCTT,S597_TGTGGTGATGTA,S169_AAGAGTCTCTAG,S92_TTCCAGGCAGAT,S15_AACAAACTGCCA,S14_GATATACCAGTG,S10_CTTCCCTAACTC,S277_TCTTCAACTACC,S112_TGGCCGTTACTG,S199_CTCGGATAGATC,S307_TGACAACCGAAT,S221_TAGAGGCGTAGG,S144_CGAATGAGTCAT,S276_AGCGGCCTATTA,S595_CTATCCAAGTGG,S127_TCTAGCCTGGCA,S43_AGCGCTCACATC,S259_CGGTCTGTCTGA,S156_CGATATCAGTAG,S237_GATGATAACCCA,S213_TGAGTCATTGAG,S111_GTGAGTCATACC,S65_TCGGCGATCATC,S110_GAACAGCTCTAC,S263_ACGGATGTTATG,S181_GTCAATTAGTGG,S7_CAATCGGCTTGC,S153_TCGGATCTGTGA,S207_ATACGCATCAAG,S50_TAGCGCGAACTT,S594_TTCCTGTTAACC,S204_TACTGAGCCTCG,S4_CTAGGATCACTG,S598_CTTTCGTTCAAC,S171_AGATCTATGCAG,S66_GTTTCACGCGAA,S176_TACCTAGTGAGA,S206_AGCTTCGACAGT,S210_CCTAGAGAAACT,S60_TCGAGCCGATCT,S100_CAAACCTATGGC,S95_CTTGGAGGCTTA,S338_GAAGACAGCGAC,S609_CATCTGGGCAAT,S164_CTTGCGGCAATC,S74_AGGGTACAGGGT,S239_TTGTATGACAGG,S610_TGTCCGTGGATC,S118_GGCTAAACTATG,S12_GCTAGACACTAC,S315_TTGCCTGGGTCA,S53_ATATCGCGATGA,S51_CATACACGCACC,S225_TGCCGAGTAATC,S48_ACGCACATACAA,S331_TGGCGATACGTT,S310_TAGGCTCGTGCT,S52_ACCTCAGTCAAG,S600_GTTGGCGTTACA,S220_AAGCAGATTGTC,S608_TCTAACGAGTGC,S114_ATCTAGTGGCAA,S592_TCTGGAACGGTT,S267_GCTGTCGTCAAC,S593_GTACTACCTCGG,S316_CAATTCTGCTTC,S302_TGCAAGCTAAGT,S151_TGCGGTTGACTC,S90_GTCACCAATCCG,S96_ACGTGGTTCCAC,S59_CAAAGTTTGCGA,S261_GTACATGTCGCC,S305_GTGTGTGCCATA,S173_CGGCAAACACTT,S278_TGGAATTCGGCT,S1_ATTGCAAGCAAC,S217_TGCTCCGTAGAA,S165_TGAGGTTTGATG,S47_ATGGCCTGACTA,S49_TGAGTGGTCTGT,S152_TGAGAAGAAAGG,S335_TACCTGTGTCTT,S18_CAAGCCCTAGTA,S98_ACAGGGTTTGTA,S56_AGCAGGCACGAA,S107_CGACACGGAGAA,S73_TGTACCAACCGA,S99_GCCTATGAGATC,S230_GTAACCACCACC,S67_ACAAGAACCTTG,S202_TCCTTAGAAGGC,S9_TGACCGGCTGTT,S162_TAACCCGATAGA,S219_CGAGTTCATCGA,S205_CTCGATGTAAGC,S6_TGAGGACTACCT,S19_TAGTGTCGGATC,S180_CGATGAATATCG,S200_TTCCCGAAACGA,S601_GAAGTAGCGAGC,S236_ACTACCTCTTCA,S119_AAGAGCAGAGCC,S161_AGTGGCACTATC,S121_TCAACCCGTGAA,S211_GAGGTTCTTGAC,S279_TAAGATGCAGTC,S115_CCTTCAATGGGA,S603_GCGGAAACATGG,S106_CCATCACATAGG,S212_CTGTAAAGGTTG,S22_CGTAGAGCTCTC,S25_GCGGACTATTCA,S55_CCGATGCCTTGA,S16_GTAGACATGTGT,S155_CACAGGATTACC,S44_TGGTTATGGCAC,S270_ATAGCTTCGTGG,S314_CTCCTTAAGGCG,S599_CCGAAGATTCTG,S167_AAGAAGCCGGAC,S148_CCGACTCTAGGT,S172_GCACAAGGCAAG,S150_GACAACGAATCT,S606_ACGTAACCACGT,S271_CGGGATCAAATT,S63_GGCCTATAAGTC,S233_CATAGCTCGGTC,S273_ATCTTGGAGTCG,S20_GTCATAAGAACC,S605_TGCATGACAGTC,S235_TATGGAGCTAGT,S17_TACAGTTACGCG,S635_TTGCGACAAAGT,S591_TACCACAACGAA,S71_CTATGCCGGCTA,S228_ACCTTGACAAGA,S46_TGCTACAGACGT,S93_TATGGTACCCAG,S24_CCTCGATGCAGT,S122_GTTTGAAACACG,S91_CACTAACAAACG,S27_CGGCCTAAGTTC,S149_ATCCTACGAGCA,S340_GGCGTTGCATTC,S128_AATGCAATGCGT,S109_ATGCCGGTAATA,S177_CGTTCTGGTGGT,S94_CACGACTTGACA,S175_TTCCGAATCGGC,S69_GTTTGGCCACAC,S203_GATGGACTTCAA,S117_ACATACTGAGCA,S216_CATGCGGATCCT,S154_GCCGGTACTCTA,S57_TACGCAGCACTA,S8_AACACTCGATCG,S272_AGTCATCGAATG,S268_AAGCTTGAAACC,S168_ACGGGATACAGG,S102_ACCATCCAACGA,S62_CCAGGGACTTCT,S2_CACGTGACATGT,S612_GTTGGTTGGCAT,S72_GTGGTATGGGAG,S105_ACACCAACACCA,S101_ATCGCTTAAGGC,S223_TAGACCGACTCC,S214_TACGGCAGTTCA,S126_CACTTTGGGTGC,S264_TTGAGGCTACAA,S590_TGCGAGTATATG,S222_TCAGCGCCGTTA,S269_TAAGCGTCTCGA,S613_TTCCACACGTGG,S5_GATGACCCAAAT,S45_TTGCACCGTCGA,S170_TCCGTCATGGGT,S257_CTACCACGGTAC,S201_GAACTTTAGCGC,S3_CACAGTTGAAGT,S64_TCCATTTCATGC,S607_GTCGGAAATTGT,S146_ATCAGAGCCCAT,S255_GGTAAGTTTGAC,S76_TTGGCGGGTTAT,S224_GTCAACGCTGTC,S103_GCAATAGGAGGA,S611_ACTCGGCCAACT,S182_AGTACGCAGTCT,S54_CGCCGGTAATCT,S75_AGAGTGCTAATC,S125_GCTCAGGACTCT,S166_ATTGCTGGTCGA,S159_GGTACCTGCAAT,S70_TGACGTAGAACT,S58_CGCTTAGTGCTG,S602_TTGCGGACCCTA,S317_ACTGGCAAACCT,S596_GTGTCCGGATTC,S124_TCGCCAGTGCAT,S21_GTCCGCAAGTTA,S238_GGCCCAATATAA,S123_AGAGAGACAGGT,S174_GCGAGTTCCTGT,S218_TGATAGGTACAC,S604_AACGTTAGTGTG,S120_GGAGAGATCACG,S318_AATCAGAGCTTG,S215_TGCACAGTCGCT,S160_TCGCCTATAAGG,S265_GTAGGAACCGGA
"Acidobacteria",2,368,6,12,62,476,227,85,58,3,918,131,34,23,17,250,67,15,2,46,69,51,0,23,18,163,120,164,224,2,104,81,143,257,7,61,279,26,139,13,319,47,3,104,13,39,113,39,72,63,132,48,4,176,8,7,31,60,7,396,38,4,86,61,230,24,176,1,13,99,62,15,25,5,121,19,62,4,51,112,275,56,19,16,1,141,1,8,5,268,32,291,27,65,23,13,282,135,1,99,157,545,93,70,144,44,19,31,8,50,10,59,39,126,6,27,10,74,42,108,4,21,14,28,46,52,9,3,45,55,2,164,122,142,332,171,289,134,4,127,76,79,55,13,656,59,4,31,28,194,10,41,413,293,101,207,9,82,10,29,3,75,2,37,47,122,179,119,229,6,35,12,67,41,115,129,231,21,171,339,7,85,2,1,157,218,75,713,12,125,2,3,322,8,59,75,0,0,282,27,133,3,25,42,32,72,9,29,241,11,28,68,210,75,536,1,62,70,108,4,73,88,8,18,272,91,0,102,2,84,78,70,13,4,35,135,57,186,586
"Actinobacteria",6,87,15,30,162,200,141,165,101,0,367,47,59,9,3,127,14,13,10,82,39,50,1,47,58,86,211,236,50,186,33,264,136,235,6,27,126,26,83,23,527,23,565,28,38,72,39,73,25,31,118,48,10,183,16,6,28,90,763,304,59,9,110,99,113,8,78,5,7,111,49,12,16,6,300,16,74,8,32,103,284,71,37,5,0,187,4,2,22,382,105,80,76,58,44,11,1633,133,10,137,193,513,22,118,40,25,56,25,3,24,11,82,35,45,16,36,14,86,33,167,1,5,17,18,46,43,11,2,112,78,5,83,53,57,311,1620,128,35,8,195,52,95,78,17,201,40,5,9,66,210,19,60,124,240,102,32,10,538,44,49,2,216,6,30,102,259,286,76,562,17,51,5,56,30,64,167,176,29,125,284,8,82,8,4,41,62,20,344,17,138,5,9,131,15,14,30,0,1,656,14,34,6,30,114,60,69,28,32,194,12,73,72,193,23,236,7,92,100,141,4,31,47,11,64,2901,65,238,80,3,116,71,55,5,104,43,100,33,68,1140
"Bacteroidetes",14,2526,40,285,1251,4724,7950,1576,1582,7,13771,679,491,109,17,7705,250,119,14,749,660,960,0,205,331,1772,3111,5167,931,202,264,4070,2885,6917,11,279,2077,183,5253,26,14552,135,769,1081,116,641,703,287,229,120,2797,92,36,4983,29,26,145,735,1267,8271,310,33,1419,1692,3170,91,1796,14,110,1479,1080,509,85,30,4419,694,951,13,1590,1208,4233,2214,277,222,5,2248,14,6,802,6901,900,2136,661,3280,1661,51,24672,1979,24,1750,6593,11393,357,1522,767,158,491,265,13,460,117,614,457,545,28,204,351,1056,669,2026,9,10,110,47,545,541,86,16,458,977,20,2478,451,543,5941,15296,1588,492,15,3820,704,1089,1321,93,4938,2036,47,46,406,3677,100,3945,3717,9054,1755,835,48,2417,135,551,34,3149,22,976,1194,4405,6098,1496,9924,33,546,120,724,177,1316,2165,4187,92,2133,5088,31,2312,38,13,1468,1228,228,8310,44,1400,189,35,1868,108,106,266,0,26,10963,34,534,135,248,1763,593,791,265,218,5421,1254,744,777,4338,124,5035,261,1175,862,2770,22,159,344,60,422,21848,4352,259,876,24,1195,805,1503,11,91,776,7774,169,1034,17498
"Chloroflexi",297,4778,333,518,1877,4905,7863,2618,1398,82,12343,1654,1005,500,264,2659,1243,261,101,1471,1010,1008,1,830,447,2919,2220,2752,1642,29,1089,5158,2502,5998,153,1096,6151,497,3161,335,5255,619,163,2009,241,700,1081,554,837,794,2219,606,110,3580,277,68,749,806,189,8837,1334,146,2443,1108,3538,316,3163,200,414,2655,1250,296,263,221,2572,435,1227,114,2271,3484,5813,299,247,383,11,4496,67,112,377,2731,1640,2711,989,2086,635,189,15615,3246,265,1085,5727,17216,1196,2184,1992,463,879,1028,42,812,561,1491,1210,2011,196,565,233,1728,888,2284,96,202,553,303,659,1226,230,59,698,1069,130,3012,1708,619,6314,8725,5504,1149,97,2026,1416,1855,1169,376,5862,1924,136,459,1015,6826,372,805,4001,7577,537,2921,136,2874,526,919,64,2450,168,213,982,3827,2568,2750,9880,133,864,253,1845,698,4384,3452,4300,253,3599,6613,208,971,415,135,3373,3696,1420,15173,371,2767,65,128,4741,367,529,723,0,35,7571,398,1605,126,437,1233,1441,1493,552,725,4915,237,840,2621,3659,910,10517,64,1370,2067,2687,88,864,859,382,1288,15388,4590,56,1757,55,1028,2902,1931,148,125,1606,4289,671,1821,8244
Printing last line of abundance table (emulating tail -1)
[uzi@quince-srv2 ~/oneliners$ perl -ne 'print if eof' phylum.csv
Cyanobacteria/Chloroplast,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Printing last 5 lines of abundance table (emulating tail -5)
Here, we push each line to @a array, and then replace it with a slice of itself. We do @a = @a[@a-5..$#a], which means, replace @a with last 5 elements of @a. @a-5 is evaluated in scalar context here and it returns number of elements in the array minus 5. #$a is the last index in the @a array.
[uzi@quince-srv2 ~/oneliners$ perl -ne 'push @a, $_; @a = @a[@a-5..$#a]; END { print @a }' phylum.csv
"Lentisphaerae",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"Elusimicrobia",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,3,0,0,0,0,0,4,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"Fusobacteria",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"Deinococcus-Thermus",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Cyanobacteria/Chloroplast,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Converting comma-delimited format to tab-delimited format
-a enables the automatic splitting of the current line into fields in the @F array. The splitting happens on the whitespace character by default. The -l argument ensures that print outputs a newline at the end of each line
[uzi@quince-srv2 ~/oneliners$ perl -F"," -alne 'print join("\t",@F)' phylum.csv | head -3
Samples S113_TAGAGCTGCCAT S104_CCGAACGTCACT S61_CTCATCATGTTC S339_ACACCTGCGATC S77_CACGATGGTCAT S266_ACAGGAGGGTGT S337_AACGAGGCAACG S262_TTCTAGAGTGCG S145_CAACGCTAGAAT S11_TGGAAGAACGGC S208_AGATGTCCGTCA S108_GAACCTATGACA S179_CTGCATACTGAG S157_CATAAGGGAGGC S163_GTGTGCTAACGT S333_GCCTTACGATAG S158_TGTGTTACTCCT S329_CAATGTAGACAC S23_CCTCTGAGAGCT S97_GACGCTTTGCTG S209_GCACCTGTTGAA S68_CGAAGCATCTAC S275_GCAAATCAGCCT S13_AACCCAGATGAT S147_TCTGTAGAGCCA S178_TTGGTCTCCTCT S183_AGCAGCTATTGC S26_CGTGCACAATTG S274_AGCACCGGTCTT S597_TGTGGTGATGTA S169_AAGAGTCTCTAG S92_TTCCAGGCAGAT S15_AACAAACTGCCA S14_GATATACCAGTG S10_CTTCCCTAACTC S277_TCTTCAACTACC S112_TGGCCGTTACTG S199_CTCGGATAGATC S307_TGACAACCGAAT S221_TAGAGGCGTAGG S144_CGAATGAGTCAT S276_AGCGGCCTATTA S595_CTATCCAAGTGG S127_TCTAGCCTGGCA S43_AGCGCTCACATC S259_CGGTCTGTCTGA S156_CGATATCAGTAG S237_GATGATAACCCA S213_TGAGTCATTGAG S111_GTGAGTCATACC S65_TCGGCGATCATC S110_GAACAGCTCTAC S263_ACGGATGTTATG S181_GTCAATTAGTGG S7_CAATCGGCTTGC S153_TCGGATCTGTGA S207_ATACGCATCAAG S50_TAGCGCGAACTT S594_TTCCTGTTAACC S204_TACTGAGCCTCG S4_CTAGGATCACTG S598_CTTTCGTTCAAC S171_AGATCTATGCAG S66_GTTTCACGCGAA S176_TACCTAGTGAGA S206_AGCTTCGACAGT S210_CCTAGAGAAACT S60_TCGAGCCGATCT S100_CAAACCTATGGC S95_CTTGGAGGCTTA S338_GAAGACAGCGAC S609_CATCTGGGCAAT S164_CTTGCGGCAATC S74_AGGGTACAGGGT S239_TTGTATGACAGG S610_TGTCCGTGGATC S118_GGCTAAACTATG S12_GCTAGACACTAC S315_TTGCCTGGGTCA S53_ATATCGCGATGA S51_CATACACGCACC S225_TGCCGAGTAATC S48_ACGCACATACAA S331_TGGCGATACGTT S310_TAGGCTCGTGCT S52_ACCTCAGTCAAG S600_GTTGGCGTTACA S220_AAGCAGATTGTC S608_TCTAACGAGTGC S114_ATCTAGTGGCAA S592_TCTGGAACGGTT S267_GCTGTCGTCAAC S593_GTACTACCTCGG S316_CAATTCTGCTTC S302_TGCAAGCTAAGT S151_TGCGGTTGACTC S90_GTCACCAATCCG S96_ACGTGGTTCCAC S59_CAAAGTTTGCGA S261_GTACATGTCGCC S305_GTGTGTGCCATA S173_CGGCAAACACTT S278_TGGAATTCGGCT S1_ATTGCAAGCAAC S217_TGCTCCGTAGAA S165_TGAGGTTTGATG S47_ATGGCCTGACTA S49_TGAGTGGTCTGT S152_TGAGAAGAAAGG S335_TACCTGTGTCTT S18_CAAGCCCTAGTA S98_ACAGGGTTTGTA S56_AGCAGGCACGAA S107_CGACACGGAGAA S73_TGTACCAACCGA S99_GCCTATGAGATC S230_GTAACCACCACC S67_ACAAGAACCTTG S202_TCCTTAGAAGGC S9_TGACCGGCTGTT S162_TAACCCGATAGA S219_CGAGTTCATCGA S205_CTCGATGTAAGC S6_TGAGGACTACCT S19_TAGTGTCGGATC S180_CGATGAATATCG S200_TTCCCGAAACGA S601_GAAGTAGCGAGC S236_ACTACCTCTTCA S119_AAGAGCAGAGCC S161_AGTGGCACTATC S121_TCAACCCGTGAA S211_GAGGTTCTTGAC S279_TAAGATGCAGTC S115_CCTTCAATGGGA S603_GCGGAAACATGG S106_CCATCACATAGG S212_CTGTAAAGGTTG S22_CGTAGAGCTCTC S25_GCGGACTATTCA S55_CCGATGCCTTGA S16_GTAGACATGTGT S155_CACAGGATTACC S44_TGGTTATGGCAC S270_ATAGCTTCGTGG S314_CTCCTTAAGGCG S599_CCGAAGATTCTG S167_AAGAAGCCGGAC S148_CCGACTCTAGGT S172_GCACAAGGCAAG S150_GACAACGAATCT S606_ACGTAACCACGT S271_CGGGATCAAATT S63_GGCCTATAAGTC S233_CATAGCTCGGTC S273_ATCTTGGAGTCG S20_GTCATAAGAACC S605_TGCATGACAGTC S235_TATGGAGCTAGT S17_TACAGTTACGCG S635_TTGCGACAAAGT S591_TACCACAACGAA S71_CTATGCCGGCTA S228_ACCTTGACAAGA S46_TGCTACAGACGT S93_TATGGTACCCAG S24_CCTCGATGCAGT S122_GTTTGAAACACG S91_CACTAACAAACG S27_CGGCCTAAGTTC S149_ATCCTACGAGCA S340_GGCGTTGCATTC S128_AATGCAATGCGT S109_ATGCCGGTAATA S177_CGTTCTGGTGGT S94_CACGACTTGACA S175_TTCCGAATCGGC S69_GTTTGGCCACAC S203_GATGGACTTCAA S117_ACATACTGAGCA S216_CATGCGGATCCT S154_GCCGGTACTCTA S57_TACGCAGCACTA S8_AACACTCGATCG S272_AGTCATCGAATG S268_AAGCTTGAAACC S168_ACGGGATACAGG S102_ACCATCCAACGA S62_CCAGGGACTTCT S2_CACGTGACATGT S612_GTTGGTTGGCAT S72_GTGGTATGGGAG S105_ACACCAACACCA S101_ATCGCTTAAGGC S223_TAGACCGACTCC S214_TACGGCAGTTCA S126_CACTTTGGGTGC S264_TTGAGGCTACAA S590_TGCGAGTATATG S222_TCAGCGCCGTTA S269_TAAGCGTCTCGA S613_TTCCACACGTGG S5_GATGACCCAAAT S45_TTGCACCGTCGA S170_TCCGTCATGGGT S257_CTACCACGGTAC S201_GAACTTTAGCGC S3_CACAGTTGAAGT S64_TCCATTTCATGC S607_GTCGGAAATTGT S146_ATCAGAGCCCAT S255_GGTAAGTTTGAC S76_TTGGCGGGTTAT S224_GTCAACGCTGTC S103_GCAATAGGAGGA S611_ACTCGGCCAACT S182_AGTACGCAGTCT S54_CGCCGGTAATCT S75_AGAGTGCTAATC S125_GCTCAGGACTCT S166_ATTGCTGGTCGA S159_GGTACCTGCAAT S70_TGACGTAGAACT S58_CGCTTAGTGCTG S602_TTGCGGACCCTA S317_ACTGGCAAACCT S596_GTGTCCGGATTC S124_TCGCCAGTGCAT S21_GTCCGCAAGTTA S238_GGCCCAATATAA S123_AGAGAGACAGGT S174_GCGAGTTCCTGT S218_TGATAGGTACAC S604_AACGTTAGTGTG S120_GGAGAGATCACG S318_AATCAGAGCTTG S215_TGCACAGTCGCT S160_TCGCCTATAAGG S265_GTAGGAACCGGA
"Acidobacteria" 2 368 6 12 62 476 227 85 58 3 918 131 34 23 17 250 67 15 2 46 69 51 0 23 18 16120 164 224 2 104 81 143 257 7 61 279 26 139 13 319 47 3 104 13 39 113 39 72 63 132 48 4 1731 60 7 396 38 4 86 61 230 24 176 1 13 99 62 15 25 5 121 19 62 4 51 112 275 56 19 16141 1 8 5 268 32 291 27 65 23 13 282 135 1 99 157 545 93 70 144 44 19 31 8 50 10 59 39126 6 27 10 74 42 108 4 21 14 28 46 52 9 3 45 55 2 164 122 142 332 171 289 134 4 127 7679 55 13 656 59 4 31 28 194 10 41 413 293 101 207 9 82 10 29 3 75 2 37 47 122 179 119 2235 12 67 41 115 129 231 21 171 339 7 85 2 1 157 218 75 713 12 125 2 3 322 8 59 75 0 0 282 27 133 3 25 42 32 72 9 29 241 11 28 68 210 75 536 1 62 70 108 4 73 88 8 18 272 91102 2 84 78 70 13 4 35 135 57 186 586
"Actinobacteria" 6 87 15 30 162 200 141 165 101 0 367 47 59 9 3 127 14 13 10 82 39 50 1 47 5886 211 236 50 186 33 264 136 235 6 27 126 26 83 23 527 23 565 28 38 72 39 73 25 31 118 48 10183 16 6 28 90 763 304 59 9 110 99 113 8 78 5 7 111 49 12 16 6 300 16 74 8 32 103 2871 37 5 0 187 4 2 22 382 105 80 76 58 44 11 1633 133 10 137 193 513 22 118 40 25 56 25 3 24 11 82 35 45 16 36 14 86 33 167 1 5 17 18 46 43 11 2 112 78 5 83 53 57 311 1620 1235 8 195 52 95 78 17 201 40 5 9 66 210 19 60 124 240 102 32 10 538 44 49 2 216 6 30 10259 286 76 562 17 51 5 56 30 64 167 176 29 125 284 8 82 8 4 41 62 20 344 17 138 5 9 1315 14 30 0 1 656 14 34 6 30 114 60 69 28 32 194 12 73 72 193 23 236 7 92 100 141 4 3147 11 64 2901 65 238 80 3 116 71 55 5 104 43 100 33 68 1140
Permutations and Combinations
In the first one-liner, use p(@a,@a,@a,@a) to generate tetra-mers. You can also give different length arrays instead of @a to p(). Last one-liner is bash-equivalent for comparative purposes.
[uzi@quince-srv2 ~/oneliners$ perl -le 'sub p{my $l=pop @_;unless(@_){return map [$_],@$l;}return map { my $ll=$_; map [@$ll,$_],@$l} p(@_);} @a=[A,C,G,T]; print join("", @$_) for p(@a,@a);'
AA
AC
AG
AT
CA
CC
CG
CT
GA
GC
GG
GT
TA
TC
TG
TT
[uzi@quince-srv2 ~/oneliners$ perl -le 'sub c{return [] unless @_;my $f=shift;my @r=c(@_); return @r,map{[$f,@$_]}@r;} print join("",@$_) for c(A,C,G,T)'
T
G
GT
C
CT
CG
CGT
A
AT
AG
AGT
AC
ACT
ACG
ACGT
[uzi@quince-srv2 ~/oneliners$ echo {A,C,G,T}{A,C,G,T}{A,C,G,T}{A,C,G,T}
AAAA AAAC AAAG AAAT AACA AACC AACG AACT AAGA AAGC AAGG AAGT AATA AATC AATG AATT ACAA ACAC ACAG ACAT ACCA ACCC ACCG ACCT ACGA ACGC ACGG ACGT ACTA ACTC ACTG ACTT AGAA AGAC AGAG AGAT AGCA AGCC AGCG AGCT AGGA AGGC AGGG AGGT AGTA AGTC AGTG AGTT ATAA ATAC ATAG ATAT ATCA ATCC ATCG ATCT ATGA ATGC ATGG ATGT ATTA ATTC ATTG ATTT CAAA CAAC CAAG CAAT CACA CACC CACG CACT CAGA CAGC CAGG CAGT CATA CATC CATG CATT CCAA CCAC CCAG CCAT CCCA CCCC CCCG CCCT CCGA CCGC CCGG CCGT CCTA CCTC CCTG CCTT CGAA CGAC CGAG CGAT CGCA CGCC CGCG CGCT CGGA CGGC CGGG CGGT CGTA CGTC CGTG CGTT CTAA CTAC CTAG CTAT CTCA CTCC CTCG CTCT CTGA CTGC CTGG CTGT CTTA CTTC CTTG CTTT GAAA GAAC GAAG GAAT GACA GACC GACG GACT GAGA GAGC GAGG GAGT GATA GATC GATG GATT GCAA GCAC GCAG GCAT GCCA GCCC GCCG GCCT GCGA GCGC GCGG GCGT GCTA GCTC GCTG GCTT GGAA GGAC GGAG GGAT GGCA GGCC GGCG GGCT GGGA GGGC GGGG GGGT GGTA GGTC GGTG GGTT GTAA GTAC GTAG GTAT GTCA GTCC GTCG GTCT GTGA GTGC GTGG GTGT GTTA GTTC GTTG GTTT TAAA TAAC TAAG TAAT TACA TACC TACG TACT TAGA TAGC TAGG TAGT TATA TATC TATG TATT TCAA TCAC TCAG TCAT TCCA TCCC TCCG TCCT TCGA TCGC TCGG TCGT TCTA TCTC TCTG TCTT TGAA TGAC TGAG TGAT TGCA TGCC TGCG TGCT TGGA TGGC TGGG TGGT TGTA TGTC TGTG TGTT TTAA TTAC TTAG TTAT TTCA TTCC TTCG TTCT TTGA TTGC TTGG TTGT TTTA TTTC TTTG TTTT
Reverse complimenting sequences
[uzi@quince-srv2 ~/oneliners$ echo -e "ACGTACGTACGT\nAGCTACT" | perl -nle 'print map{$_ =~ tr/ACGT/TGCA/; $_} reverse split("",$_)'
ACGTACGTACGT
AGTAGCT
Manipulating FASTQ files
We can print four lines of FASTQ file at a time
[uzi@quince-srv2 ~/oneliners$ perl -ne 'push @a, $_; @a = @a[@a-4..$#a]; print @a if ($. % 4 == 0)' forward.fastq | head
@MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCTGAATTACTGGGCTTAAAGCGTACGTAGGCGGATAGGAAAGTTGGGGGTGAAATCCCAG
+
DDDDCFFFFCCDGGGGGGGGGGGGHHHHHHHHHHHHHHHGGGGGHGHHHHHHHHHHHHHHHGGGGGGHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHGGGGGHGGHGHHHHHHHHHHHHHHHHHHHHHHHHGHHGGGGGGGHHHHHGHGGHHHGHGHHGGGGFGGHHHHGGGGGGGGGFFFFFFFFFF0;:;FFFFFFFFF/:;FFFFFFFFFFFFFFFFF:.;;BBFFFFFFFFFFFFFFFFFFFF
@MSQ-M01442:38:000000000-A5H4M:1:1102:12736:13241 1:N:0:CGAGGCTGAAGGAGTA
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCGGGGAGGAAGGCGACAAGGTTAATAACCTTGTCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGCAATACGGAGGGTGCAAGCGTTAATCGGAAT
+
AAAA1>>1AD@1AE?GFBF0C0FGHHHF21AFGHFGCF/EEGHFGEHHHHHFDA1AAGFB@EEGEEEHHHFE1GE>0/FCEEGFEFFGGHFDHFHHFFGDDFG?CC@ACCG0<..BB/9FBAAAB-FE-B-F-9@99-;-:FB?A--=@-9-/BFEFA9AEBB-ABAE
...
...
Infact we are printing all the lines:
[uzi@quince-srv2 ~/oneliners$ perl -ne 'push @a, $_; @a = @a[@a-4..$#a]; print @a if ($. % 4 == 0)' forward.fastq | wc -l
40000
[uzi@quince-srv2 ~/oneliners$ wc -l < forward.fastq
40000
So why is this functionality useful? We can convert the FASTQ file to FASTA file quite easily by printing first two elements of @a:
[uzi@quince-srv2 ~/oneliners$ perl -ne 'push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){print ">".substr($a[0],1).$a[1]}' forward.fastq | head
>MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCTGAATTACTGGGCTTAAAGCGTACGTAGGCGGATAGGAAAGTTGGGGGTGAAATCCCAG
>MSQ-M01442:38:000000000-A5H4M:1:1102:12736:13241 1:N:0:CGAGGCTGAAGGAGTA
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCGGGGAGGAAGGCGACAAGGTTAATAACCTTGTCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGCAATACGGAGGGTGCAAGCGTTAATCGGAAT
>MSQ-M01442:38:000000000-A5H4M:1:1108:11468:15061 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTGCACAATGGGCGGAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGTTTGTGACGGTACCTGCAGAAGAAGCGCCGGCCAACTACGTGCCAGCAGCCGCGGTAAGACGTAGGGCGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGCTTGTCGCGTCGACTGTGAAAA
>MSQ-M01442:38:000000000-A5H4M:1:2104:20735:20554 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGGGAAGAAGGCCTTCGGGTTGTAAACCGCTTTTGTCAGGGAAGAAATCCTTTGAGTTAATACCTCGGAGGGATGACGGTACCTGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>MSQ-M01442:38:000000000-A5H4M:1:2114:25736:16989 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATCCCGCGTGTGCGATGAAGGCCTTCGGGTTGTAAAGCACTTTTGGCAGGAAAGAAACGGCACGGGCTAATATCCTGTGCAACTGTCGGTACCTGCAGAATAAGCACCGGCTAACTACGTGCCAGCAGCTGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCTTGCGCAGGCGGTT
...
...
We can also draw the PHRED quality score for each read in a graphical manner:
[uzi@quince-srv2 ~/oneliners$ perl -ne 'push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){chomp($a[3]);print substr($a[0],1).join("\n", map{ "_"x(ord($_)-33).(ord($_)-33) }split("",$a[3]))."\n"}' forward.fastq | head -251
MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
___________________________________35
___________________________________35
___________________________________35
___________________________________35
__________________________________34
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
__________________________________34
__________________________________34
___________________________________35
______________________________________38
______________________________________38
______________________________________38
______________________________________38
______________________________________38
______________________________________38
______________________________________38
_______________________________________39
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_______________15
__________________________26
_________________________25
__________________________26
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
______________14
_________________________25
__________________________26
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_________________________25
_____________13
__________________________26
__________________________26
_________________________________33
_________________________________33
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
...
...
Say we are given a set of GBK files in a folder. They can be genomes from NCBI, or annotated contigs from software such as PROKKA. We can generate a simple tabulated structure (first column being LOCUS and all other columns containing the features we are interested in) using the one-liners mentioned below:
Let us start with some random GBK files I downloaded from NCBI
[uzi@quince-srv2 ~/oneliners/GBK_files$ ls
NC_013164.gbk NC_014259.gbk NC_017100.gbk NC_021150.gbk NC_021285.gbk
Note: In the one-liners given below,
remove:
| perl -alne '{print join("\t",@F[0..5])}'
-->to store in a file, replace the above command with
> table.tsv
-->to store in a file with unique keywords:, replace with
| perl -MList::MoreUtils=uniq -alne '{print join("\t",uniq @F)}' > table.tsv
The perl command containing @F[0..5] is not required as it just to show the structure of output without cluttering this demonstration. Also note that if some genome/contig doesn't contain a feature
awk 'NF>1{print $0}' will exclude it from the output
You can check
here on how I have used the extracted EC Numbers with a
rest-style KEGG API
Extracting /db_xref="taxons:*"
[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/taxon/{print gensub(" +/db_xref=\"taxon:(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}'
NC_013164 525919
NC_014259 436717
NC_017100 634453
NC_021150 1283331
NC_021285 1167634
Extracting /EC_number="*"
[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}'
NC_013164 2.1.1.37 2.7.7.49
NC_014259 6.1.1.1 4.2.1.9 4.1.1.31 1.3.1.26 1.1.1.274
NC_021285 3.1.4.- 3.6.3.25 2.4.1.- 2.7.7.7 2.3.1.1
Extracting /db_xref="GeneID:*"
[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/GeneID/{print gensub(" +/db_xref=\"GeneID:(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}'
NC_013164 8368769 8368769 8368657 8368657 8368658
NC_014259 9384321 9384321 9380389 9380389 9380390
NC_017100 12066596 12066596 12066598 12066598 12066600
NC_021150 15372752 15372752 15367888 15367888 15370905
NC_021285 16467804 16467804 16467805 16467805 16467806
Extracting /db_xref="GI:*"
[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/\/db_xref=\"GI/{print gensub(" +/db_xref=\"GI:(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}'
NC_013164 256821124 256821125 256821126 256821127 256821128
NC_014259 299768251 299768252 299768253 299768254 299768255
NC_017100 384049542 384049543 384049544 384049545 384049546
NC_021150 482897842 482899070 482898704 482898122 482899071
NC_021285 528981797 528981798 528981799 528981800 528981801
Extracting /db_xref="CDD:*"
[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/CDD/{print gensub(" +/db_xref=\"CDD:(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}'
NC_013164 200366 100105 100105 234461 200370
NC_014259 221153 232941 99707 99707 99707
NC_017100 218785 238416 223655 238416 238416
NC_021150 221153 232941 99707 99707 99707
NC_021285 183285 239900 239900 239900 170049
Extracting /gene="*"
[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/\/gene=/{print gensub(" +/gene=\"(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}'
NC_014259 dnaA dnaA dnaA dnaA dnaA
NC_017100 rusA rusA rusA bcr/cflA bcr/cflA
NC_021150 dnaA dnaA dnaA dnaA dnaA
Extracting unique /gene="*"
[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/\/gene=/{print gensub(" +/gene=\"(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -MList::MoreUtils=uniq -alne '{print join("\t",uniq @F)}'
NC_014259 dnaA anmK dkgB nusB ribH psd pyrE murG murC ddl lpxC aceE glmM actP hisH hisB ompR gltX mraW mraY aroK aroB gltB gltD rpoZ gmk ispH rpsP rimM trmD rplS truB glyS glyQ aroE ribA engB rpsJ rplC rplD rplW rplB rpsS rplV rpsC rplP rpsQ rplN rplX rplE rpsN rpsH rplF rplR rpsE rpmD rplO secY rpmJ rpsM rpsD rplQ rplM ksgA apaH secF secD tgt queA aspS trpA mscL gatB gatA thrH valS rplU rpmA sdhA sdhB sucA sucC greA glpD groES groEL dapF tolB ruvB ruvA dapD tmk rnc era obgE mnmA rpmE hemE ampC trpC trpD ubiA cca fadE rpsB tsf queF glyA ppnK thiG gpsA rpsU purH prmA rplI rpsR rpsF prfA miaA pepN fabG murB moaC fumC ispF pyrH frr lpxD fabZ recA ompL argC pyrG eno ispD aroD tynA antC aspA recR hscA rpsT rpsA cmk gidB pgk ruvC purT tauD paaB paaA cysS xseA benD trmB pyrB rlmL alaS rnhB adk ureE ureC ureA lysS metH dut aat rpsL coaD smpB hemA ipk rpmF bioD dcd metG upp argJ hisD hisG rho ihfA pheT pheS rplT rpmI infC thrS panC panB mazG rumA cysM prfC ilvH leuS engA hisS ispG ndk proA clpX clpP tig metX thyA rpmB rpmG gltP truA infA leuD rpmE2 rpsO infB nusA secG tpiA coaE uvrC fadB fadA rpoB rplL rplJ rplA rplK hemC nrdR def atpC argS fur guaA prpB lldD pgi rph ileS lspA dnaK trmE rpmH
NC_017100 rusA bcr/cflA uvrA nifU recA modE/mopA mdoG mdoH recR hrcA exbB/tolQ exbD/tolR rodA ftsI mreD mreC mreB fusB marR degT suv3 nusG secE pfkB cysA cysW cysT araC acrB nodT luxR hlyD/fusE lysR ligB ttg2D djlA queA yjgF ttg2C uvrB impB clpS clpA chvG chvI grpE dnaK dnaJ duf6 fur rho corA phoB phoR oprO corC/hlyC eamA mazG htpG xre iojap aldC oprB ttg1D nifB/moaA dedA matE nhaAP fecCD sss mauE lysE pnuC badM/rrf2 crcB tetR hspA cobW uspA bipA glnBK lepA metW rhlE hppK rpoK era ompH fabZ yidC truA kpsF/gutQ spoU dnaB mscS radA gntR sua5 holC degP hrpB hsp33 recX tatAE ribF mutL ompA comA ahpD xerD secB tim44 mutS2 hicA hicB arsR hap nol1/nop2 dnaA cycH cycL dsbE cysK cycJ ccmC rmuC kdpE rpoB rpoC secY rpoA moxR emrB/qacA greA gatB rpoD yjgPQ minE minD minC carD ftsX ftsE hslV hslU sufE prmA uvrD/rep radC secF secD smc phoH miaB uvrC MoeD MoeE lexA moeA maf thiO thiS recG cobT cobS bolA hemK dnaC tatD osmC exsB/queC smpB purS terC moaC secG recF tolC nusB deoR recQ rpoE phnA apbA phzCF xdhC xdhB xdhA marR/hxlR czcA hlyD emrE asmA fecR/pupR copG rrf5 rrl5 rrs5 cbiD cbiG cobH cobG fis fdhD ftsJ sirA hlyA oppC oppB oppD oppF npd padR fcuA cydC cydD mgtC czcB/hlyD czcC rrf4 rrl4 rrs4 iclR lolD lolCE baf hu lon clpX clpP hesB sufS sufD sufC sufB hlyB/msbA mutS pmbA/tldD asnC mucR mviN plsX smpA/omlA envZ ompR rimM ffh/srp54 atp12 engA sod clpB hflX hfq ntrX ntrY ntrC ntrB nifR3 ispDF ubiHF groES groEL fkbH amtB xerC yajC apaG pfpI ribD obgE/cgtA mraZ mraW ftsW ftsQ ftsA ftsZ recN copA copB recO copZ lytB pirin glgX pcpB sco1/senC rrf3 rrl3 rrs3 dsbD tatC tatB scpB scpA pqqA pqqB pqqC pqqD pqqE aatC recJ rpoH fecE secB-2 pdxJ cobP cobD/cbiB cobQ cobOP cbiA cobN ycdH ureG ureF ureE ureD trbI trbG trbF trbL repA rrf2 rrl2 rrs2 xdhC/coxI truB cvpA ftsY phoU pstB pstA pstC apbC nusA ccmA ccmB thiC ftr1 mutY rrf1 rrl1 rrs1 ompW aadR/crp/fnr htrA ssb traR/dksA tldD ftsK nifS folC ctrA ftsH tilS tolB ruvB ruvA ruvC secA argJ hisA hisF parB parA gidB gidA trmE/thdF
NC_021150 dnaA recF gyrB hdtS glyS glyQ trkA sun fmt def lysM dprA yrdC qor aroE pcoA pcoB cysC2 uvrD polA thrB gadh3 gadh1 gadh2 cyoE ctaD purE purK modA2 nifH nifD nifK nifT nifY lrv nifE nifN nifX fesI iscAnif nifU nifS nifV cysE1nif nifW nifZ nifM clpX nifF osmC pagL dsbA cycA engB zwf-4 hexR-3 pycA pycB trpI trpB trpA cbpA clpP aglA-2 vnfY vnfK vnfG vnfD vnfF vnfH pcaK moeB2 vnfX vnfN vnfE vnfA vnfU relA rpoZ gmk rph crc pyrE argB algC dut coaBC radC mtgA thiG trmB rdgB metW metX proC pilT pyrC pyrB pyrR gshB pilG bioA ilvD hipA coaD metF ahcY ligB fucP dctM dctQ dctP dctD2 dctB2 fdhD fdhA fdhH fdhI fdhE moaA mqo glk gluP hoxY hoxH hoxW cooJ cooT cooC cooS cooF yggX mutY hisB hisH hisA hisF pgm-1 secB argE argA tonB gshA bcsZ bcsD bcsC bcsAB ompR envZ hslO pckA cls nudE metK tktA-3 epd pgk fba cccA katG glcB dsbG bioB bioF bioH bioC birA coaX tuf secE nusG rplK rplA rplJ rplL rpoB rpoC rpsL rpsG fusA rpsJ rplC rplD rplW rplB rpsS rplV rpsC rplP rpmC rpsQ rplN rplX rplE rpsN rplR rpsE rpmD rplO secY rpmJ rpsM rpsD rpoA rplQ uvrA oprG-2 oprG-1 nrdR ribD ribE ribB ribH nusB thiL ribA retS purD purH fis prmA accC accB aroQ dipZ speA serB psd rhdA mutL miaA hfq hflX hflK hflC hisZ purA rnr rpsF rpsR rplI dnaB fpr2 tesB trmA bcpB btuB dxs-1 ispA-1 xseB-1 ppa aldA mpl ubiX tonB2 ftsH lrgA lrgB proA nadD rodA mltB dacA lipB lipA benR benA benB benC benD xylJ xylI xylH xylQ xylT xylE xylG xylK dmpI dmpB dmpP dmpO dmpN dmpM dpmL dmpK holA leuS lnt miaB rhaU hpaI aldB hemL thiE thiD ureG ureE dctA-2 dsbB leuD2 leuC2 ureC ureB ureA eutB eutC rluE hrpB cueR selD folX mdcA citG mdcC mdcD mdcG madL madM nhaA fixA fixB fixC fixX tonB3 algA algF algV algI algL algX algG algJ algK alg44 alg8 algD algB coxB coxA moaC moaD moaE rhlB estB apc4 apc3 apc2 apc1 valS pepA lptF lptG mviN ribF ileS lspA ispH fimT pilE pilV pilW pilY1 thiO comL rluD clpB ndh coaE pilD pilC pilB pilA nadC ampD ampE fruR fruB fruK fruA prfC speC lldD gatB gatA gatC mreB mreC mreD maf tldD tldE ptsO ptsN rpoN lptB lptA kpsF murA hisG hisD hisC algW cysD cysN trpS nfi rplM rpsI petA petB petC sspA sspB gmhA mraZ mraW ftsL ftsI murE murF mraY murD ftsW murG murC ddl ftsQ ftsA ftsZ lpxC secA argJ apbA ampG fabG pdxH fxsA groES groEL rsuA ung nadB mucA mucB mucC mucD lepA lepB rnc era alyA3 recO pdxJ mmsA lon putA metR metE argF bfr rnt pyrC2 argG mobA moaB moeA ech vdh fcs aat1 etfB1 eftA1 pobA pobR fadB fadA topA lexA psrA mfd gapB nqrA nqrB nqrC nqrD nqrE nqrF apbE sthA lolC lolD lolE comA exbB exbD lpxK kdsB ptpA murB rne rluC rpmF plsX fabD fabG2 acpP fabF pabC tmk holB pilZ mhpB mphC mhpR mhpD mhpF mphE mhpT etfB2 eftA2 pyrF efp ohr hmgA htpX fadE queF trkH ispZ scpA scpB edd-1 glk2 hexR-2 pgl-1 eda-2 gph ubiG mtnA gyrA serC pheA hisC2 aroA cmk rpsA ihfB rfaH rfbB rfbD rfbA rfbC gspD gspG2 gspE cysC1 eexD eexE eexF hrpA fadB2 etfB3 etfA3 zwf-3 pgi zwf-2 pgl-2 pgi2 colS colR ppdK dctB dctD pepN pyrD rlmL dacB htpG glxI rnfE rnfG rnfD rnfC rnfB rnfA metG apbC dcd dinG agaA estC pdxH2 shaAB shaC shaD shaF shaG livG livM dnaX recR cydB cydA cydR hemN ccoS ccoI ccoH ccoG ccoP ccoQ ccoO ccoN acnA tusA tyrB uvrB gltX thrS infC rpmI rplT pheS pheT ihfA dxs-2 ispA-2 xseB-2 idi acxA acxB acxC acxR rhdE cysE3 alc entA entF entB entE csbC csbX vbiC metC ssuD tauD nemA asfC asfB asfA torG rpiB talB-1 tktA-1 rpiA-1 pykA-2 eno-2 xdhC xdhB xdhA dszA gntT gntK-1 feoA feoB feoC exeA greB oprI aroG cysB cysH pabB gntR prpB prpC acnD prpF ppsA rraA oprF nasH nasB nasA narK atsA pfkB nasT nasS acnB lpxH glnS cysS folD tig clpP2 clpX2 pepS16 ppiD phbC phbA phbB phbR phbP phbF folE speE alyA2 flgN flgM flgA flgB flgC flgD flgE flgF flgG flgH flgI flgJ flgK flgL fliR fliQ fliP fliO fliN fliM fliL fliK fliJ fliI fliH fliG fliF fliE fliD fliS fliT flhX flhD flhC motA motB cheA cheW cheR kdpD kdpE kup kdsA glgX treY malQ treZ glgA galU gor gacA uvrC pgsA actP fpvI fpvR groEL2 gapA gcvP2 sdaA gcvT3 hutG hutI hutU phyH hutH hutF nifA2 lysE pvdE ptxS kguE kguK kguT kguD fagA fumC eno-1 pykA-1 eda-1 zwf-1 hexR-1 edd-2 gntK-2 mtlY tktA-2 talB-2 mtlD mtlK mtlG mtlF mtlE mtlR pgm-2 ybhE cysI nudC motD motC fliC rfbC2 rfbG rfbF hicB flaG fliA flhE flhF flhA flhB cheZ cheY cheB glgB treS glgE cobA serS rarA lolA ftsK trxB aat infA clpA clpS cspD icd mnmA purB nuoA nuoB nuocd nuoE nuoF nuoG nuoH nuoI nuoJ nuoK nuoL nuoM nuoN gspG1 gspF nfuA xth ppiC sixA gpsA fabA fabB metH arsD arsC arsB arsA dnaQ rnhA fabI sucC lpdA sucB sucA sdhB sdhA sdhD sdhC gltA wbpO wbpP phaJ codA hepA ligA zipA smc moaA2 cycH ccmH ccmG ccmF ccmE ccmD ccmC ccmB ccmA fleN sodA ssuB ssuC ssuA ssuE cysP lapQ lapG lapF lapI lapH lapE lapC arsE lapP lapO lapN lapM lapL lapK lapB lapR ttdB ttdA dctA-3 dszA2 cas2 cas1 cas4 csd2 csd1 cas5d oprE cysT cysW cysA alyA1 gcvT2 hpaR talB-3 cyaB recQ cysK rnd cobS cobC cobT cobU cobQ cobD cbiA btuB2 aroF dctA-1 pcpS nrdA nrdB vnfA2 algE5 mexT mexE mexF ppnK metZ purF folC accD trpF truA asd leuB leuD leuC dusC fleQ algZ mgtE csrA lysC alaS argD acsA1 pta-1 ackA-1 phhA phhB ggpS hppD fumC2 mnmC asnB actP2 htrB minC minD minE rluA rdgC purC dapA nadA frdC trbG trbF trbL trbJ trbE trbD trbC trbB traF oprL tolB tolA tolR tolQ ybgC ruvB ruvA ruvC aspS proS pgsA2 purM purN relB relE1 mazG ilvM rumA cysM gacS dinP tmp lysB rimO pfpI cls2 sodB araJ aroQ2 pip hppD2 pcaC pcaD pcaB pcaF pcaJ pcaI pcaR pcaG pcaH pcaQ purU mvaT sbcB pykA-6 fumB fpr1 finR recX recA ibpB ibpA mutS fdxA aerP rpoS nlpD pcm surE truD ispF fghA ispD ftsB eno kdsA2 pyrG metJ accA dnaE rnhB lpxB lpxA fabZ lpxD ompH mucP dxr csdA uppS frr pyrH tsf rpsB map glnD dapC dapD thiF dapE rrmA cspA yeaZ adk ppc lysS prfB recJ yaeQ thrC hom dsbC xerD rplS trmD rimM rpsP ffh purT purL guaA xseA acoK gcd pvdH fpvB engA hisS ispG pilF ndk iscX fdx hscA hscB iscA iscU iscS iscR cysE2 trmH suhB secF secD yajC tgt queA rpsT proB obgE rpmA rplU gerC fklB fpvA glyA cobW cstA radA mscL ackA-2 pta-2 mqo2 hpt upp hemH phr murI prfA hemA lolB ipk pth ychF pqqF pqqE pqqD pqqC pqqB acoR acoA acoB acoC adh otsA otsB rimI leuA pqiA pssA ilvC ilvH ilvI mrcB hmuV sfsA aspT dksA cbrA cbrB pcnB folK panB panC panD pgi3 acsA2 pnp rpsO truB rbfA infB nusA secG glmM folP ftsH1 rrmJ greA carB carA dapB dnaJ dnaK grpE recN fur omlA smpB lctP mosA mosB maeB2 lldp glcB2 glcG glcF glcE glcD glcC eno-3 pykA-3 cdaR yhaD gcl garR pirA parC parE sfnG seuB ribAB livH seuA metB msuD sfnR pta-3 ackA-3 thiC rfaE msbA galE waaP waaG ilvE glnE aceE aceF msrA relE2 hemE gltD gltB aroB aroK pilQ pilP pilO pilN pilM ponA maeB rpmE priA argS hslV hslU ubiE ubiB hisI hisE tatA tatB tatC mdoH2 mdoD mdoH1 mdoG dtd fbp glpF glpK glpR glpD eno-4 tpiA-2 typA thiI glnA ntrB ntrC ahpC bioD tyrS anmK erpA argC coq7 speD crp trpC trpD trpGD pdhB pdhA cadR pncA mdaB ada hpaI2 trpE gph2 rpe gabT lptD pdxA ksgA apaG apaH glpE prkA cca folK1 folB gcp rpsU dnaG rpoD cooA katN vnfA3 fecA gcvP1 gcvH gcvT ubiF ubiH ubiD rho trxA ppx ppk hemB feoB2 algQ dsbB2 hemD hemC algR argH corA cyaY lppL lysA dapF xerC amtB glnK pirR rep xpt cycB dadX dadA lrp aldH rpmG rpmB rpoH ftsX ftsE ftsY glpQ thyA lgt ptsP nudH ilvA rpiA-2 serA recG muc26 hupA ubiC ubiA phoB phoR phoU pstB pstA pstC pstS potI potH potG potF spuC spuB spuA anfR anfO anfK anfG anfD anfH anfA anfU gabD algY tctE tctD tctC tctB tctA fic tpiA-3 tpiA-1 eno-5 pykA-5 iolC iolE iolB iolT iolA iolD idh iolH hppD3 fahA maiA nafY rnfH rnfE1 rnfG1 rnfD1 rnfC1 rnfB1 rnfA1 nifL nifA nifB nifO nifQ norR hmp algE3 algE2 algE1 algE4 algE6 algE7 dgoT galD dgoA dgoK dgoR pykA-4 melA aglA-1 scrY-2 hbdH bktA lpxO recD recB recC nhaA2 scrY-1 lacY scrB scrR glmS glmR glmU atpC atpD atpG atpA atpH atpF atpE atpB atpI parB parA gidB gidA trmE rnpA rpmH
Say we have two FASTA files that contain some duplicate sequences (exact match) but with different IDs.
[uzi@quince-srv2 ~/oneliners$ cat f1.fasta
>F1
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
AGCGTGGGGAGCAAACAGGATTAGATACCCTAGTAGTC
>F2
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
TAGGGGAGCGAATGGGATTAGATACCCTAGTAGTC
>F3
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
TAGGGGAGCGAATGGGATTAGATACCCGAGTAGTC
>F4_C1
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
TAGGGGAGCGAATGGGATTAGATACCCTTGTAGTC
>F5_C2
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
AGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTC
[uzi@quince-srv2 ~/oneliners$ cat f2.fasta
>C1_F4
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
TAGGGGAGCGAATGGGATTAGATACCCTTGTAGTC
>C2_F5
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
AGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTC
>C3
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
AGCGTGGGGAGCAAACAGGATTAGATACCCTAGTAGTC
>C4
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
AGCGTGGGGAGCAAACAGGATTAGATACCCGGGTAGTC
>C5
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
TAGGGGAGCGAATGGGATTAGATACCCTGGTAGTC
We can use the following one-liner to identify list of unique sequences per line with duplicates as comma separated list:
[uzi@quince-srv2 ~/oneliners$ cat f1.fasta f2.fasta | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' | awk 'NR>1{ if(NR%2==0){gsub(">","",$1);h=$1} else {a[$1]++;b[$1]=b[$1]","h}} END { for (n in a) {gsub("^,","",b[n]);print b[n]} }'
F3
F4_C1,C1_F4
C4
C3
F2
C5
F5_C2,C2_F5
F1
In the above one-liner, the first awk statement linearizes the FASTA file so that header and sequence alternate on separate lines.To identify duplicates only, just add an extra awk statement at the end:
[uzi@quince-srv2 ~/oneliners$ cat f1.fasta f2.fasta | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' | awk 'NR>1{ if(NR%2==0){gsub(">","",$1);h=$1} else {a[$1]++;b[$1]=b[$1]","h}} END { for (n in a) {gsub("^,","",b[n]);print b[n]} }' | awk -F, 'NF>1'
F4_C1,C1_F4
F5_C2,C2_F5
Say the data is in the following format:
[uzi@quince-srv2 ~/oneliners$ cat test.txt
SAMPLE MIR ABUNDANCE
sample1 mir1 30
sample1 mir3 100
sample1 mir4 120
sample2 mir1 40
sample2 mir2 200
sample3 mir1 190
sample3 mir1 400
sample4 mir4 20
sample5 mir1 19
You can use the following one-liner to convert the above data into tab-delimited abundance table:
[uzi@quince-srv2 ~/oneliners$ perl -ane 'if ($. > 1){$r{$F[0].":".$F[1]}=$F[2];unless($F[0]~~@s){push @s,$F[0];}unless($F[1]~~@m){push @m,$F[1];}}END{print "Samples\t".join("\t",@s)."\n";for($i=0;$i<@m;$i++){print $m[$i];for($j=0;$j<@s;$j++){(not defined $r{$s[$j].":".$m[$i]})?print "\t".0:print"\t".$r{$s[$j].":".$m[$i]};}print "\n";}}' test.txt
Samples sample1 sample2 sample3 sample4 sample5
mir1 30 40 400 0 19
mir3 100 0 0 0 0
mir4 120 0 0 20 0
mir2 0 200 0 0 0
We can also dereplicate the reads from FASTQ files and generate a sorted list of reads in usearch header format as a FASTA file:
Single-end FASTQ file
[uzi@quince-srv2 ~/oneliners$ perl -MDigest::MD5=md5_hex -ne 'push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){chomp($a[1]);$d{uc($a[1])}++;} END { foreach my $n (sort{ $d{$b} <=> $d{$a} } keys %d){print ">".md5_hex($n).";size=".$d{$n}.";\n".$n."\n"}}' forward.fastq > forward_derep.fasta
[uzi@quince-srv2 ~/oneliners$ head forward_derep.fasta
>55961afb4a0df6979393aa7f2ad02cac;size=340;
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>c5a3043fda41c9b4f972826c3bba8154;size=263;
CCTATGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>12071fab7db0b36abcc540059f9bb78b;size=119;
CCTACGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTGGAAAGTTGGGGGTGAAATCCCAG
>76d3994572a1c37336f2b4dd897434a7;size=107;
CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTCGCGTGAAGGAAGACTGTCCTAAGGATTGTAAACTTCTTTTATACGGGAATAACGGGCGATACGAGTATTGCATTGAATGTACCGTAAGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGTTGTTCGGTA
>c6e34aa95f6b068fcd12e3308e92815a;size=104;
CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGCAGGCCTGAACCAGCCAAGTAGCGTGAAGGATGACTGCCCTATGGGTTGTAAACTTCTTTTATATGGGAATAAAGTTTTCCACGTGTGGAATTTTGTATGTACCATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGACAGTTA
Paired-end FASTQ files
[uzi@quince-srv2 ~/oneliners$ paste forward.fastq reverse.fastq | perl -MDigest::MD5=md5_hex -ne 'chomp($_);push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){@q=split("\t",$a[1]);$d{uc($q[0].":".$q[1])}++;} END {open(FF,">","forward_derep.fasta");open(RR,">","reverse_derep.fasta");foreach my $n (sort{ $d{$b} <=> $d{$a} } keys %d){$l=md5_hex($n);@r=split(":",$n);print FF ">".$l." 1;size=".$d{$n}.";\n".$r[0]."\n"; print RR ">".$l." 2;size=".$d{$n}.";\n".$r[1]."\n"; }}'
[uzi@quince-srv2 ~/oneliners$ head *_derep.fasta
==> forward_derep.fasta <==
>948b3e6c0d9c5140a20b3e950baf513b 1;size=45;
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>84ee339f8f7d47fb5278cc7d211c2856 1;size=32;
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>087be525fc7401eb8dc965c0b0b20a83 1;size=30;
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>5046bcb4cbabfb76f0f6966c0ada2695 1;size=30;
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>792f79536de130a5e6ded001bbe8a191 1;size=28;
CCTATGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
==> reverse_derep.fasta <==
>948b3e6c0d9c5140a20b3e950baf513b 2;size=45;
GACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG
>84ee339f8f7d47fb5278cc7d211c2856 2;size=32;
GACTACCCGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG
>087be525fc7401eb8dc965c0b0b20a83 2;size=30;
GACTACAAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG
>5046bcb4cbabfb76f0f6966c0ada2695 2;size=30;
GACTACCGGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG
>792f79536de130a5e6ded001bbe8a191 2;size=28;
GACTACTAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG
Here, we are trying to emulate the functionality of pandaseq using EMBOSS water (which is an implementation of Smith-Waterman local alignment algorithm):
[uzi@quince-srv2 ~/oneliners$ paste forward.fastq reverse.fastq | perl -ne 'chomp($_);push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){@q=split("\t",$a[1]); $q[1]=join("",map{$_ =~ tr/ACGT/TGCA/; $_} reverse split("",$q[1])) ;$r=qx/water -asequence=asis:$q[0] -bsequence=asis:$q[1] -gapopen=50 -gapextend=0.5 -stdout -auto -aformat3 markx1/;$r=~/\s*asis (.*)\n\s*asis (.*)/;if ($1 ne ""){@h=split(" ",$a[0]);$m_f=$1;$m_r=$2;$q[0]=~/(.*)$m_f/;$r_l=$1;$q[1]=~/$m_r(.*)/;$r_r=$1;print ">".substr($h[0],1)."\n".$r_l.$m_f.$r_r."\n"}}' > merged.fasta
This one-liner is useful for assembling paired-end reads for 16S rRNA amplicons where the primers are designed in such a manner that there is a substantial overlap between the paired-end reads. The above one-liner could be improved further. I am using a very high Gap Open Panelty which gives me only one local alignment and spans one alignment record in the output file generated by water. If I make it smaller, my regular expressions will fail as the alignments output by water will span multiple lines (later I'll update it). Moreover, I am not taking the consensus of the overlapped region, rather keeping the region from forward reads with the view that generally reverse reads have lower quality than the forward reads.
For comparative purposes, I have generated merged_pandaseq.fasta from pandasq:
pandaseq -f forward.fastq -r reverse.fastq -o 10 > merged_pandaseq.fasta
where I have used -o 10, -o 20, and -o 50, respectively. I then used bowtie2 to map the reads against the reference database:
bowtie2-build ref.fasta ref
bowtie2 -t ref -f reads.fasta > reads.sam
and generated the following table:
Algorithms |
Perl one-liner (gapopen=50,gapextend=0.5) |
pandaseq (o=10) |
pandaseq (o=20) |
pandaseq (o=50) |
Total Reads |
10000 |
10000 |
10000 |
10000 |
Overlapped Reads |
7197 |
9753 |
9592 |
3440 |
No Alignment |
34 (0.47%) |
314 (3.22%) |
273 (2.85%) |
23 (0.67%) |
Aligned exactly 1 times |
684 (9.50%) |
920 (9.43%) |
903 (9.41%) |
263 (7.65%) |
Aligned > 1 times |
6479 (90.02%) |
8519 (87.35%) |
8416 (87.74%) |
3154 (91.69%) |
Overall alignment rate |
99.53% |
96.78% |
97.15% |
99.33% |
% Overlapped reads |
71.97% |
97.53% |
95.92% |
34.4% |
It can be seen that my one-liner hasn't done a bad job. I am getting around 72% overlapped reads and out of which 99.53% are matching. pandaseq on the other hand is sensitive to the minimum overlap parameter.
These taxonomic analysis software are frequently used in metagenomic analysis to generate taxonomic assignments for small-subunit ribosomal RNA (SSU rRNA) sequences. They rely on the
lowest common ancestor (LCA) based taxonomic classification method by searching sequences against public
databases, assigning them to the most similar taxa in a given taxonomy, and generating the abundance tables.
Abundance table from CREST:
The following two commands will produce an assignment fle named input_file_Assignments.txt for input_file.fa:
$ megablast -i input_file.fa -b 50 -v 50 -m 7 -d $PATH_TO_CREST_DB/silvamod.fasta -a 10 -o input_file.xml
$ classify -i input_file.fa -a -p input_file.xml
To extract an abundance table and the corresponding tree from CREST, the data should first be organized to follow the folder structure shown below.
This organisation makes it easier to run repetitive functionality on the subfolders, each containing an assignment file for a sample with sample names as folder names or what will appear in the abundance table. Moreover, with this structure, we can use the collateResults.pl utility that: searches subfolders for files with names matching a certain pattern; extracts the quantitative information stored in a delimited format in them; and then merges the information from all of the subfolders together.
The following bash one-liner will produce an abundance table (test.csv) and the corresponding tree in the Newick format (test.nwk) at Phylum level for a single sample when run inside the S1 folder on the terminal:
[uzi@quince-srv2 ~/oneliners/CREST/Main/S1]$ cat *_Assignments.txt | awk -F"\t" -v k="test.csv" -v l=3 '{a=match($1,"size=.*;");if(a==0) c=1; else c=substr($1,RSTART+5,RLENGTH-6);$0=$2";";gsub("\\(","[",$0);gsub("\\)","]",$0);gsub(" ","_",$0);split($0,ss,";");$0="";for (i=1;i<=l;i++) $0=$0";"ss[i];gsub("^;","",$0);$0=$0";"; gsub(".*;;$","Cellular_organisms;__Unknown__;",$0);gsub("\"","",$0);b[$0]+=c} END {for (i in b) print i;print "Sample,S1" > k ;for (i in b) print gensub(".*;(.*?);","\\1","g",i)","b[i] >> k}' | java -classpath ~/bin Path2Newick > test.nwk
The following set of commands will produce an abundance table (collated.csv) and the corresponding tree in the Newick format (collated.nwk) at Phylum level for multiple samples when run inside the Main folder on the terminal:
[uzi@quince-srv2 ~/oneliners/CREST/Main]$ for i in $(ls -d */); do i=${i%%/}; cd $i; cat *_Assignments.txt | awk -F"\t" -v k="test.csv" -v l=3 '{a=match($1,"size=.*;");if(a==0) c=1; else c=substr($1,RSTART+5,RLENGTH-6);$0=$2";";gsub("\\(","[",$0);gsub("\\)","]",$0);gsub(" ","_",$0);split($0,ss,";");$0="";for (i=1;i<=l;i++) $0=$0";"ss[i];gsub("^;","",$0);$0=$0";"; gsub(".*;;$","Cellular_organisms;__Unknown__;",$0);gsub("\"","",$0);b[$0]+=c} END {for (i in b) print i;for (i in b) print gensub(".*;(.*?);","\\1","g",i)","b[i] >> k}'; cd ..; done | java -classpath ~/bin Path2Newick > collated.nwk
[uzi@quince-srv2 ~/oneliners/CREST/Main]$ perl ~/bin/collateResults.pl -f . -p test.csv > collated.csv
The above bash one-liner generates a test.csv file in each subfolder, which is then used in the second command to collate the results together to produce collated.csv.
Abundance table from RDP: The following command will produce an output file named input_file_Assignments.txt for input_file.fa:
$ java -Xmx1g -jar classifier.jar classify -f filterbyconf -o input_file_Assignments.txt input_file.fa
We follow the same folder structure as mentioned previously. The bash one-liners for RDP are slightly different due to the different format of the assignment files, however they follow the same philosophy. The following bash one-liner will produce an abundance table (test.csv) and the corresponding tree in the Newick format (test.nwk) at Phylum level for a single sample when run inside the S1 folder on the terminal:
[uzi@quince-srv2 ~/oneliners/RDP/Main/S1]$ cat *_Assignments.txt | awk -F"\t" -v k="test.csv" -v l=3 '{a=match($1,"size=.*;");if(a==0) c=1; else c=substr($1,RSTART+5,RLENGTH-6);$0="Cellular_organisms;"$3";"$6";"$9";"$12";"$15";"$18";";gsub("\\(","[",$0);gsub("\\)","]",$0);gsub(" ","_",$0);split($0,ss,";");$0="";for (i=1;i<=l;i++) $0=$0";"ss[i];gsub("^;","",$0);$0=$0";"; gsub(".*;;$","Cellular_organisms;__Unknown__;",$0);gsub("\"","",$0);b[$0]+=c} END {for (i in b) print i;print "Sample,S1" > k ;for (i in b) print gensub(".*;(.*?);","\\1","g",i)","b[i] >> k}' | java -classpath ~/bin Path2Newick > test.nwk
The following set of commands will produce an abundance table (collated.csv) and the corresponding tree in the Newick format (collated.nwk) at Phylum level for multiple samples when run inside the Main folder on the terminal:
[uzi@quince-srv2 ~/oneliners/RDP/Main]$ for i in $(ls -d */); do i=${i%%/}; cd $i; cat *_Assignments.txt | awk -F"\t" -v k="test.csv" -v l=3 '{a=match($1,"size=.*;");if(a==0) c=1; else c=substr($1,RSTART+5,RLENGTH-6);$0="Cellular_organisms;"$3";"$6";"$9";"$12";"$15";"$18";";gsub("\\(","[",$0);gsub("\\)","]",$0);gsub(" ","_",$0);split($0,ss,";");$0="";for (i=1;i<=l;i++) $0=$0";"ss[i];gsub("^;","",$0);$0=$0";"; gsub(".*;;$","Cellular_organisms;__Unknown__;",$0);gsub("\"","",$0);b[$0]+=c} END {for (i in b) print i;for (i in b) print gensub(".*;(.*?);","\\1","g",i)","b[i] >> k}'; cd ..; done | java -classpath ~/bin Path2Newick > collated.nwk
[uzi@quince-srv2 ~/oneliners/RDP/Main]$ perl ~/bin/collateResults.pl -f . -p test.csv > collated.csv
For producing trees at other taxonomic levels, in the first awk statement in the above bash one-liners, use: l=4 for Class, l=5 for Order, l=6 for Family, and l=7 for Genus, respectively. These bash one-liners will work even if we run the classifiers on the dereplicated reads in the usearch header format i.e. containing the string size=.* in the FASTA headers. If the original dereplicated sequences are in a different format, we first convert them to the usearch header format before running the classifiers and using the above bash one-liners. For example, if we have generated the denoised and dereplicated reads from AmpliconNoise, then the following bash one-liner can be used to convert the FASTA headers:
$ awk '/>/{$0=gensub("(.*?)_(.*?)$","\\1;size=\\2;","g",$0)}1' dereplicated.fa > dereplicated_usearch_format.fa
Note: Delete test.csv file from individual folders everytime you run bash one-liners for multiple samples (say for a different value of l)
These bash one-liners rely on
Path2Newick.java utility to convert extracted paths to tree in the Newick format.Both the abundance table and tree can then be used in
PHYLOmap.py to generate heatmap. The working of Path2Newick and an example heatmap generated from PHYLOmap is given below
Step 1: We first get the average quality score and flow cell information and store as DAT files, i.e., redirect output to FILE.dat in the following example
perl -ne 'chomp;$_=~m/@(.*?):(.*?):(.*?):(.*?):(.*?):(.*?):(.*?)\s/;print $4.",",$5.",".$6.",".$7;<>;<>;$_=<>;chomp;map{$s+=ord($_)-33}split("");print",".($s/length($_))."\n";$s=0;' < FILE.fastq > FILE.dat
This one-liner gives:
Column 1: Flow cell lane
Column 2: Tile number within the flowcell lane
Column 3: x-coordinate of the cluster within the tile
Column 4: y-coordinate of the cluster within the tile
Column 5: Average quality score
For example,
[uzi@quince-srv2 ~/oneliners]$ perl -ne 'chomp;$_=~m/@(.*?):(.*?):(.*?):(.*?):(.*?):(.*?):(.*?)\s/;print $4.",",$5.",".$6.",".$7;<>;<>;$_=<>;chomp;map{$s+=ord($_)-33}split("");print",".($s/length($_))."\n";$s=0;' < 5_TAAGGCGA-TAGATCGC_L001_R2_001.fastq | head -10
1,1101,9593,26322,26.4814814814815
1,1101,7952,34132,27.6470588235294
1,1101,8189,100645,26.3333333333333
1,1102,10633,88138,25.8048780487805
1,1103,5153,25465,36.7029702970297
1,1103,9533,58452,27.4752475247525
1,1103,21248,64336,28.1842105263158
1,1104,19815,7437,23.0769230769231
1,1105,14548,7890,30.4653465346535
1,1105,8795,73701,21.3076923076923
Step 2: Having generated the individual dat files (*_R1.dat and *_R2.dat), collate the forward reads and reverse reads separately to generate R1.dat and R2.dat, respectively.
[uzi@quince-srv2 ~/oneliners]$ cat *_R1.dat > R1.dat
[uzi@quince-srv2 ~/oneliners]$ cat *_R2.dat > R2.dat
Step 3: Find unique flow cell information from both R1.dat and R2.dat
[uzi@quince-srv2 ~/oneliners]$ awk -F"," '{print $2}' R1.dat | sort | uniq > R1.uniq
[uzi@quince-srv2 ~/oneliners]$ awk -F"," '{print $2}' R2.dat | sort | uniq > R2.uniq
Step 4: Extract data for each flow cell in a separate folder and save it as space-separated (required by gnuplot) text file. This will create TXT files 1101.txt, 1102.txt, and so on.
[uzi@quince-srv2 ~/oneliners]$ for i in $(cat R1.uniq); do awk -F"," -v pattern=$i '$2==pattern{gsub(","," ",$0);print $0}' R1.dat > R1/$i.txt; done
[uzi@quince-srv2 ~/oneliners]$ for i in $(cat R2.uniq); do awk -F"," -v pattern=$i '$2==pattern{gsub(","," ",$0);print $0}' R2.dat > R2/$i.txt; done
Step 5: Make a file gnuscript.txt with the following contents and place it in both R1 and R2 folders. This script when called generates a scatterplot as *.txt.ps file
PATH=$0
DATAFILE=$1
set palette
plot PATH."/".DATAFILE using 3:4:5 with dots palette notitle
set xlabel "X"
set ylabel "Y"
set term postscript color
set output DATAFILE.'.ps'
replot
set term x11
Step 6: Call the script on individual folders and use ImageMagick to convert the *.txt.ps into *.txt.jpg, for example, in R1 folder:
[uzi@quince-srv2 ~/oneliners/R1]$ for i in $(ls ????.txt); do echo processing $i now;echo "call \"gnuscript.txt\" \"'.'\" \"'$i'\"" | gnuplot; convert $i.ps $i.jpg; done
Step 7: To get the total reads for each tile, use
[uzi@quince-srv2 ~/oneliners/R1]$ awk -F"," '{print $2}' R1.dat | sort | uniq -c
[uzi@quince-srv2 ~/oneliners/R1]$ awk -F"," '{print $2}' R1.dat | sort | uniq -c
Note: I have used the above one-liners to assess the read qualities between two runs (Run1 and Run2) where flow cells of Run1 are less dense (roughly 1/7) than Run2, with total number of flow cells being 28. As a result of high density, I was getting fewer overlapping reads (assembled from forward and reverse reads) for Run2. In the plots generated from Step 6, reads with lighter shades correspond to higher PHRED quality scores.
The following observations were then made from the results given below: in general, forward reads are better quality than reverse reads; noisy reads show some sort of spatial trend, i.e. reads on the edges are more noisy;
there is missing data around the edges and are those reads filtered out by QC of reads before processing;
and there are rectangular patches that are more prominent for Run1. In some cases, the average read qualities in patches differ.
Morever, in the plot for the flow cell 1114, Run2, one can notice a purple artifact in both forward and reverse reads. Similarly, for the flow cell 2114, reverse reads are worse than other flow cells.
1101.jpg
1102.jpg
1103.jpg
1104.jpg
1105.jpg
1106.jpg
1107.jpg
1108.jpg
1109.jpg
1110.jpg
1111.jpg
1112.jpg
1113.jpg
2101.jpg
2102.jpg
2103.jpg
2104.jpg
2105.jpg
2106.jpg
2107.jpg
2108.jpg
2109.jpg
2110.jpg
2111.jpg
2112.jpg
2113.jpg
We can use reservoir sampling to subsample FASTA and FASTQ files. I have the following perl one-liners (though I'd prefer awk one-liner in this case) to subsample paired-end FASTQ files (the first one generates the initial seed based on compressed output of "ps awww" command):
[uzi@quince-srv2 ~/oneliners]$ paste forward.fastq reverse.fastq | perl -ne 'BEGIN{$k=10;srand (time ^ $$ ^ unpack "%L*", `ps axww | gzip`)}chomp($_);push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){$s= ($x++<$k) ? $x-1:int(rand()*$x);if($s<$k){$R{$s}=join("\t",@a)}} END { open(FR,">","forward_sub.fastq");open(RR,">","reverse_sub.fastq");foreach(keys%R){@a=split("\t",$R{$_});print FR join("\n",@a[grep { !($_ & 1) } 0 .. $#a])."\n"; print RR join("\n",@a[grep { ($_ & 1) } 0 .. $#a])."\n"}}'
[uzi@quince-srv2 ~/oneliners]$ paste forward.fastq reverse.fastq | perl -ne 'BEGIN{$k=10;srand (time ^ $$)}chomp($_);push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){$s= ($x++<$k) ? $x-1:int(rand()*$x);if($s<$k){$R{$s}=join("\t",@a)}} END { open(FR,">","forward_sub.fastq");open(RR,">","reverse_sub.fastq");foreach(keys%R){@a=split("\t",$R{$_});print FR join("\n",@a[0,2,4,6])."\n"; print RR join("\n",@a[1,3,5,7])."\n"}}'
In the above one-liners, you can change $k to the number you want to downsample your reads to.
Reservoir sampling: It is a simple random sampling strategy to produce a sample without replacement from a stream of data in one pass: O(N). We want to sample s instances - uniformly at random without replacement - from a population size of n records, where n is not known.
Figuring out n would require 2 passes. Reservoir sampling achieves this in 1 pass.
A reservoir R here is simply an array of size s. Let D be data stream of size n, then the algorithm is as follows:
Store first s elements into R.
for each element in position k = s+1 to n ,
accept it with probability s/k
if accepted, choose a random element from R to replace.
Using reservoir sampling, we have written several bash scripts that allow us to subsample single-end and paired-end fasta as well as fastq files. The content of these scripts are as follows:
subsample_se_fasta.sh:
Example usage: ./subsample_se_fasta.sh s.fasta s_sub.fasta 100000
#!/bin/bash
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < $1 | awk 'NR>1{ printf("%s",$0); n++; if(n%2==0) { printf("\n");} else { printf("\t");} }' | awk -v k=$3 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<k?x-1:int(rand()*x);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | awk -F"\t" -v f="$2" '{print $1"\n"$2 > f}'
subsample_se_fastq.sh:
Example usage: ./subsample_se_fastq.sh s.fastq s_sub.fastq 100000
#!/bin/bash
cat $1 | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=$3 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<k?x-1:int(rand()*x);if(s<<)R[s]=$0}END{for(i in R)print R[i]}' | awk -F"\t" -v f="$2" '{print $1"\n"$2"\n"$3"\n"$4 > f}'
subsample_pe_fasta.sh:
Example usage: ./subsample_pe_fasta.sh f.fasta r.fasta f_sub.fasta r_sub.fasta 100000
#!/bin/bash
paste <(awk '/^>/ {printf("\n%s\n",$0);next; } {printf("%s",$0);} END {printf("\n");}' < $1) <(awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < $2) | awk 'NR>1{ printf("%s",$0); n++; if(n%2==0) { printf("\n");} else { printf("\t");} }' | awk -v k=$5 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<k?x-1:int(rand()*x);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | awk -F"\t" -v f="$3" -v r="$4" '{print $1"\n"$3 > f;print $2"\n"$4 > r}'
subsample_pe_fastq.sh:
Example usage: ./subsample_pe_fastq.sh f.fastq r.fastq f_sub.fastq r_sub.fastq 100000
#!/bin/bash
paste $1 $2 | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=$5 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<<k?x-1:int(rand()*x);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | awk -F"\t" -v f="$3" -v r="$4" '{print $1"\n"$3"\n"$5"\n"$7 > f;print $2"\n"$4"\n"$6"\n"$8 > r}'
Here I am using an example dataset to identify the list of genomes with low coverage. It is assumed that BWA, Samtools, and Bedtools are installed. Here are the steps:
Step 1: Index the reference database file (my reference database comprises 59 genomes)
[uzi@quince-srv2 ~/oneliners]$ bwa index ref.fasta
Step 2: Use BWA-MEM to align paired-end sequences. Briefly, the algorithm works by seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW)
[uzi@quince-srv2 ~/oneliners]$ bwa mem ref.fasta forward.fastq reverse.fastq > aln-pe.sam
Step 3: Convert sam file to binary bam file
[uzi@quince-srv2 ~/oneliners]$ samtools view -h -b -S aln-pe.sam > aln-pe.bam
Step 4: Extract only those sequences that were mapped against the reference database. Use -F 4 switch.
[uzi@quince-srv2 ~/oneliners]$ samtools view -b -F 4 aln-pe.bam > aln-pe.mapped.bam
Step 5: Generate a file length.genome that contains two entries per row, genome identifier and genome length.
[uzi@quince-srv2 ~/oneliners]$ samtools view -H aln-pe.mapped.bam | perl -ne 'if ($_ =~ m/^\@SQ/) { print $_ }' | perl -ne 'if ($_ =~ m/SN:(.+)\s+LN:(\d+)/) { print $1, "\t", $2, "\n"}' > lengths.genome
Step 6: Sort BAM file. Many of the downstream analysis programs that use BAM files actually require a sorted BAM file. -m specifies the maximum memory to use, and can be changed to fit your system.
[uzi@quince-srv2 ~/oneliners]$ samtools sort -m 1000000000 aln-pe.mapped.bam aln-pe.mapped.sorted
Step 7: We will now use bedtools. It is a very useful suite of programs for working with SAM/BAM, BED, VCF and GFF files, files that you will encouter many times doing NGS analysis. -ibam switch takes indexed bam file that we generated earlier, -d reports the depth at each genome position with 1-based coordinates, and -g used the genome lengths file we generated earlier
[uzi@quince-srv2 ~/oneliners]$ bedtools genomecov -ibam aln-pe.mapped.sorted.bam -d -g lengths.genome > aln-pe.mapped.bam.perbase.cov
Step 8: Look at the first few entries in the file generated above. First column is genome identifier, second column is position on genome, third column is coverage.
[uzi@quince-srv2 ~/oneliners]$ head aln-pe.mapped.bam.perbase.cov
Acidobacterium_capsulatum_ATCC_51196 1 8
Acidobacterium_capsulatum_ATCC_51196 2 8
Acidobacterium_capsulatum_ATCC_51196 3 8
Acidobacterium_capsulatum_ATCC_51196 4 9
Acidobacterium_capsulatum_ATCC_51196 5 9
Acidobacterium_capsulatum_ATCC_51196 6 9
Acidobacterium_capsulatum_ATCC_51196 7 9
Acidobacterium_capsulatum_ATCC_51196 8 9
Acidobacterium_capsulatum_ATCC_51196 9 9
Acidobacterium_capsulatum_ATCC_51196 10 12
Step 9: Now we will count only those positions where we have >0 coverage.
[uzi@quince-srv2 ~/oneliners]$ awk -F"\t" '$3>0{print $1}' aln-pe.mapped.bam.perbase.cov | sort | uniq -c > aln-pe.mapped.bam.perbase.count
Step 10: To see what we have done, use the cat command
[uzi@quince-srv2 ~/oneliners]$ cat aln-pe.mapped.bam.perbase.count
4126255 Acidobacterium_capsulatum_ATCC_51196
2664070 Akkermansia_muciniphila_ATCC_BAA-835
2176909 Archaeoglobus_fulgidus_DSM_4304
6259688 Bacteroides_thetaiotaomicron_VPI-5482
5162898 Bacteroides_vulgatus_ATCC_8482
5334732 Bordetella_bronchiseptica_strain_RB50
6779108 Burkholderia1_xenovorans_LB400_chromosome_1,_complete_sequence
2968320 Caldisaccharolyticus_DSM_8903
2759841 Chlorobiumlimicola_DSM_245
3133587 Chlorobiumphaeobacteroides_DSM_266
1966851 Chlorobiumphaeovibrioides_DSM_265
2154672 Chlorobiumtepidum_TLS
5248942 Chloroflexus_aurantiacus_J-10-fl
3841892 Clostridium_thermocellum_ATCC_27405
3048801 Deinococcus_radiodurans_R1_chromosome_1,_complete_sequence
2796764 Desulfovibrio_piger_ATCC_29098
1819007 Dictyoglomus_turgidum_DSM_6724
2520796 Enterococcus_faecalis_V583
1458832 Fusobacterium_nucleatum_subsp._nucleatum_ATCC_25586
4636091 Gemmatimonas_aurantiaca_T-27_DNA
77854 gi|220903286|ref|NC_011883.1|
2909411 gi|222528057|ref|NC_012034.1|
4920011 gi|307128764|ref|NC_014500.1|
1739058 gi|55979969|ref|NC_006461.1|
7919 gi|83591340|ref|NC_007643.1|
6345359 Herpetosiphon_aurantiacus_ATCC_23779
1558912 Hydrogenobaculum_sp._Y04AAS1
1278529 Ignicoccus_hospitalis_KIN4/I
4901664 Leptothrix_cholodnii_SP-6
1634756 Methanocaldococcus_jannaschii_DSM_2661
1770072 Methanococcus_maripaludis_C5
1650606 Methanococcus_maripaludis_strain_S2,_complete_sequence
459768 Nanoarchaeum_equitans_Kin4-M
2798863 Nitrosomonas_europaea_ATCC_19718
6409848 Nostoc_sp._PCC_7120_DNA
3017983 Pelodictyon_phaeoclathratiforme_BU-1
1929147 Persephonella_marina_EX-H1
2354658 Porphyromonas_gingivalis_ATCC_33277_DNA
2219120 Pyrobaculum_aerophilum_str._IM2
1998126 Pyrobaculum_calidifontis_JCM_11548
1738110 Pyrococcus_horikoshii_OT3_DNA
7145136 Rhodopirellula_baltica_SH_1_complete_genome
4106385 Ruegeria_pomeroyi_DSS-3
5762894 Salinispora_arenicola_CNS-205
5168083 Salinispora_tropica_CNB-440
5229615 Shewanella_baltica_OS185
5145817 Shewanella_baltica_OS223,
3478897 Sulfitobacter_NAS-14.1_scf_1099451320477_
3029105 Sulfitobacter_sp._EE-36_scf_1099451318008_
2666279 Sulfolobus_tokodaii
1520314 Sulfurihydrogenibium_yellowstonense_SS-5
1828687 SulfuriYO3AOP1
2361532 Thermoanaerobacter_pseudethanolicus_ATCC_33223
1884531 Thermotoga_neapolitana_DSM_4359
1823470 Thermotoga_petrophila_RKU-1
1877693 Thermotoga_sp._RQ2
2829658 Treponema_denticola_ATCC_35405
2274638 Treponema_vincentii_ATCC_35580_NZACYH00000000.1
2052189 Zymomonas_mobilis_subsp._mobilis_ZM4
Step 11: We will now use the above file with lengths.genome to calculate the proportions using the following one-liner. It reads lengths.genome line by line, assigns the genome identifier to myArray[0] and it's length to myArray[1]. It then searches the identifier in aln-pe.mapped.bam.perbase.count, extracts the base count, and uses bc to calculate the proportion.
[uzi@quince-srv2 ~/oneliners]$ while IFS=$'\t' read -r -a myArray; do echo -e "${myArray[0]},$( echo "scale=5;"$(awk -v pattern="${myArray[0]}" '$2==pattern{print $1}' aln-pe.mapped.bam.perbase.count)"/"${myArray[1]} | bc ) "; done < lengths.genome > aln-pe.mapped.bam.genomeproportion
Step 12: Some of the reference genomes were downloaded from NCBI. We will use a file IDs_metagenomes2.txt that contains the meaningful mapping from accession numbers to genome names.
[uzi@quince-srv2 ~/oneliners]$ cat IDs_metagenomes2.txt
gi|220903286|ref|NC_011883.1|,Desulfovibrio_desulfuricans
gi|222528057|ref|NC_012034.1|,Caldicellulosiruptor_bescii
gi|307128764|ref|NC_014500.1|,Dickeya_dadantii
gi|55979969|ref|NC_006461.1|,Thermus_thermophilus
gi|83591340|ref|NC_007643.1|,Rhodospirillum_rubrum
Step 13: Download my annotation script convIDs.pl and then use IDs_metagenomes2.txt to annotate column 1.
[uzi@quince-srv2 ~/oneliners]$ perl convIDs.pl -i aln-pe.mapped.bam.genomeproportion -l IDs_metagenomes2.txt -c 1 -t comma > aln-pe.mapped.bam.genomeproportion.annotated
Step 14: We have a total of 59 genomes in the reference database. To see how many genomes we recovered, we will use the following one-liner:
[uzi@quince-srv2 ~/oneliners]$ awk -F "," '{sum+=$NF} END{print "Total genomes covered:"sum}' aln-pe.mapped.bam.genomeproportion
Total genomes covered:55.8055
Step 15: Now we will identify those genomes for which the proportions are less than 0.99.
[uzi@quince-srv2 ~/oneliners]$ awk -F "," '$NF<=0.99{print $0}' aln-pe.mapped.bam.genomeproportion.annotated
Burkholderia1_xenovorans_LB400_chromosome_1,_complete_sequence,.69664
Desulfovibrio_desulfuricans,.02709
Desulfovibrio_piger_ATCC_29098,.98358
Dictyoglomus_turgidum_DSM_6724,.98030
Enterococcus_faecalis_V583,.78333
Fusobacterium_nucleatum_subsp._nucleatum_ATCC_25586,.67088
Ignicoccus_hospitalis_KIN4/I,.98534
Methanocaldococcus_jannaschii_DSM_2661,.98185
Nanoarchaeum_equitans_Kin4-M,.93661
Rhodospirillum_rubrum,.00181
Sulfolobus_tokodaii,.98943
Thermus_thermophilus,.94016
Treponema_vincentii_ATCC_35580_NZACYH00000000.1,.90457
Step 16: The complete list is as follows:
[uzi@quince-srv2 ~/oneliners]$ cat aln-pe.mapped.bam.genomeproportion.annotated
Acidobacterium_capsulatum_ATCC_51196,.99973
Akkermansia_muciniphila_ATCC_BAA-835,.99998
Archaeoglobus_fulgidus_DSM_4304,.99931
Bacteroides_thetaiotaomicron_VPI-5482,.99989
Bacteroides_vulgatus_ATCC_8482,.99994
Bordetella_bronchiseptica_strain_RB50,.99916
Burkholderia1_xenovorans_LB400_chromosome_1,_complete_sequence,.69664
Caldicellulosiruptor_bescii,.99646
Caldisaccharolyticus_DSM_8903,.99934
Chlorobiumlimicola_DSM_245,.99879
Chlorobiumphaeobacteroides_DSM_266,.99989
Chlorobiumphaeovibrioides_DSM_265,.99999
Chlorobiumtepidum_TLS,.99987
Chloroflexus_aurantiacus_J-10-fl,.99817
Clostridium_thermocellum_ATCC_27405,.99963
Deinococcus_radiodurans_R1_chromosome_1,_complete_sequence,.99601
Desulfovibrio_desulfuricans,.02709
Desulfovibrio_piger_ATCC_29098,.98358
Dickeya_dadantii,.99943
Dictyoglomus_turgidum_DSM_6724,.98030
Enterococcus_faecalis_V583,.78333
Fusobacterium_nucleatum_subsp._nucleatum_ATCC_25586,.67088
Gemmatimonas_aurantiaca_T-27_DNA,.99981
Herpetosiphon_aurantiacus_ATCC_23779,.99980
Hydrogenobaculum_sp._Y04AAS1,.99961
Ignicoccus_hospitalis_KIN4/I,.98534
Leptothrix_cholodnii_SP-6,.99842
Methanocaldococcus_jannaschii_DSM_2661,.98185
Methanococcus_maripaludis_C5,.99399
Methanococcus_maripaludis_strain_S2,_complete_sequence,.99366
Nanoarchaeum_equitans_Kin4-M,.93661
Nitrosomonas_europaea_ATCC_19718,.99529
Nostoc_sp._PCC_7120_DNA,.99938
Pelodictyon_phaeoclathratiforme_BU-1,.99991
Persephonella_marina_EX-H1,.99941
Porphyromonas_gingivalis_ATCC_33277_DNA,.99990
Pyrobaculum_aerophilum_str._IM2,.99851
Pyrobaculum_calidifontis_JCM_11548,.99443
Pyrococcus_horikoshii_OT3_DNA,.99977
Rhodopirellula_baltica_SH_1_complete_genome,.99993
Rhodospirillum_rubrum,.00181
Ruegeria_pomeroyi_DSS-3,.99925
Salinispora_arenicola_CNS-205,.99594
Salinispora_tropica_CNB-440,.99705
Shewanella_baltica_OS185,.99998
Shewanella_baltica_OS223,,.99998
Sulfitobacter_NAS-14.1_scf_1099451320477_,.99697
Sulfitobacter_sp._EE-36_scf_1099451318008_,.99921
Sulfolobus_tokodaii,.98943
Sulfurihydrogenibium_yellowstonense_SS-5,.99087
SulfuriYO3AOP1,.99469
Thermoanaerobacter_pseudethanolicus_ATCC_33223,.99945
Thermotoga_neapolitana_DSM_4359,.99998
Thermotoga_petrophila_RKU-1,.99997
Thermotoga_sp._RQ2,1.00000
Thermus_thermophilus,.94016
Treponema_denticola_ATCC_35405,.99523
Treponema_vincentii_ATCC_35580_NZACYH00000000.1,.90457
It is assumed that BWA, Samtools, Picard Tools, Interactive Genome Viewer, GATK, and VCFtools are installed and available in the system $PATH variable.
I have a sample from Clostridium Difficile 630 from which I will summarize variants (SNPs).
Index reference database downloaded from above link:
[uzi@quince-srv2 ~/oneliners/SNPs]$ bwa index ref.fa
Generate a SAM file by aligning paired-end reads against the reference database
[uzi@quince-srv2 ~/oneliners/SNPs]$ bwa mem ref.fasta forward.fastq reverse.fastq > aln-pe.sam
Convert SAM to BAM file
[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools view -h -b -S aln-pe.sam > aln-pe.bam
Sort BAM file
[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools sort -m 1000000000 aln-pe.bam aln-pe.sorted
Now we will check alignment statistics using Picard tools
[uzi@quince-srv2 ~/oneliners/SNPs]$ java -jar $(which CollectAlignmentSummaryMetrics.jar) INPUT=aln-pe.sorted.bam OUTPUT=aln-pe.sorted.alignment_stats.txt REFERENCE_SEQUENCE=ref.fa
[uzi@quince-srv2 ~/oneliners/SNPs]$ grep -vi -e "^#" -e "^$" aln-pe.sorted.alignment_stats.txt
CATEGORY TOTAL_READS PF_READS PCT_PF_READS PF_NOISE_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_HQ_ALIGNED_READS PF_HQ_ALIGNED_BASES PF_HQ_ALIGNED_Q20_BASES PF_HQ_MEDIAN_MISMATCHES PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE MEAN_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS BAD_CYCLES STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER SAMPLE LIBRARY READ_GROUP
FIRST_OF_PAIR 333371 333371 1 0 313569 0.940601 90716605 301550 88367790 87254618 2 0.0105 0.010296 0.000251 298.542723 311278 0.992694 0.501912 0.012613 0.000183
SECOND_OF_PAIR 333265 333265 1 0 312663 0.938181 89921348 300915 87620763 84183606 2 0.013972 0.013701 0.000256 298.30708 311185 0.995273 0 0.500162 0.012173 0.000159
PAIR 666636 666636 1 0 626232 0.939391 180637953 602465 175988553 171438224 2 0.012228 0.011991 0.000253 298.42492 622463 0.993981 0.501038 0.012393 0.000171
The detailed description of these summary metrics are given here. From this link, the following metrics are of our interest:
PF_MISMATCH_RATE: The rate of bases mismatching the reference for all bases aligned to the reference sequence
PF_HQ_ERROR_RATE: The percentage of bases that mismatch the reference in PF HQ aligned reads
PF_INDEL_RATE: The number of insertion and deletion events per 100 aligned bases. Uses the number of events as the numerator, not the number of inserted or deleted bases
[uzi@quince-srv2 ~/oneliners/SNPs]$ grep -vi -e "^#" -e "^$" aln-pe.sorted.alignment_stats.txt | perl -alne 'print join("\t",@F[0,12,13,14])'
CATEGORY PF_MISMATCH_RATE PF_HQ_ERROR_RATE PF_INDEL_RATE
FIRST_OF_PAIR 0.0105 0.010296 0.000251
SECOND_OF_PAIR 0.013972 0.013701 0.000256
PAIR 0.012228 0.011991 0.000253
As can be seen, the error rates are quite low and we can proceed with the analysis.
Extract only those sequences that were mapped against the reference database. Use -F 4 switch:
[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools view -b -F 4 aln-pe.bam > aln-pe.mapped.bam
Sort mapped reads BAM file
[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools sort -m 1000000000 aln-pe.mapped.bam aln-pe.mapped.sorted.bam
Now, we will assess the quality of mapped reads by generating charts for GC Bias, Mean Quality by Cycle, and Quality Score Distribution
[uzi@quince-srv2 ~/oneliners/SNPs]$ java -Xmx2g -jar $(which CollectGcBiasMetrics.jar) R=ref.fa I=aln-pe.mapped.sorted.bam O=aln-pe.mapped.sorted_GCBias.txt CHART=aln-pe.mapped.sorted_GCBias.pdf ASSUME_SORTED=true
[uzi@quince-srv2 ~/oneliners/SNPs]$ java -Xmx2g -jar $(which MeanQualityByCycle.jar) INPUT=aln-pe.mapped.sorted.bam OUTPUT=aln-pe.mapped.sorted.mqc.txt CHART_OUTPUT=aln-pe.mapped.sorted.mqc.pdf
[uzi@quince-srv2 ~/oneliners/SNPs]$ java -Xmx2g -jar $(which QualityScoreDistribution.jar) INPUT=aln-pe.mapped.sorted.bam OUTPUT=aln-pe.mapped.sorted.qsd.txt CHART_OUTPUT=aln-pe.mapped.sorted.qsd.pdf
We will use two approaches to call SNPs
Calling SNPs with samtools mpileup
To generate a BCF file (binary data format used to hold information about sequence variants such as SNPs), run the following lines
[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools mpileup -uD -f ref.fa aln-pe.mapped.sorted.bam | bcftools view -bvcg - > aln-pe.mapped.sorted.raw.bcf
where -u tells samtools to output into an uncompressed bcf file, -D tells it to keep read depth for each sample, and -f specifies the reference genome file. Samtools calculate the genotype likelihoods and the bcftools does SNP calling based on those likelihoods.
The -b flag in bcftools tells it to output to BCF format instead of VCF format, -c tells it to do SNP calling, -v tells it to only output potential variant sites, and -g tells it to call genotypes for each sample in addition to just calling SNPs.
Now run
[uzi@quince-srv2 ~/oneliners/SNPs]$ bcftools view aln-pe.mapped.sorted.raw.bcf | vcfutils.pl varFilter -D100 > aln-pe.mapped.sorted.vcf
This line converts the binary BCF file into a text VCF file. We then pipe that into vcfutils.pl with the varFilter -D100 option, which filters out SNPs that had read depth higher than 100 (as sites with higher coverage than that represent variation between variable copy number repeats). Depending on the coverage you have in the dataset you can change this number, for example, you can use -D500.
Samtools detect SNPs by checking number of different reads that share a mis-match from the reference, the sequence quality data, and the expected sequencing error rates at a given site. Now we display first few records of the VCF file. You can get more information on the format from here.
[uzi@quince-srv2 ~/oneliners/SNPs]$ head -50 aln-pe.mapped.sorted.vcf
##fileformat=VCFv4.1
##samtoolsVersion=0.1.18 (r982:295)
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT aln-pe.mapped.sorted.bam
Cdiff630 1047 . C T 216 . DP=60;VDB=0.0135;AF1=1;AC1=2;DP4=0,0,33,25;MQ=60;FQ=-202 GT:PL:DP:GQ 1/1:249,175,0:58:99
Cdiff630 1960 . C T 177 . DP=64;VDB=0.0137;AF1=1;AC1=2;DP4=0,0,27,37;MQ=60;FQ=-220 GT:PL:DP:GQ 1/1:210,193,0:64:99
Cdiff630 2107 . G A 216 . DP=58;VDB=0.0136;AF1=1;AC1=2;DP4=0,0,29,28;MQ=60;FQ=-199 GT:PL:DP:GQ 1/1:249,172,0:57:99
Cdiff630 2206 . A G 205 . DP=55;VDB=0.0134;AF1=1;AC1=2;DP4=0,0,23,30;MQ=60;FQ=-187 GT:PL:DP:GQ 1/1:238,160,0:53:99
Cdiff630 2212 . T C 222 . DP=56;VDB=0.0134;AF1=1;AC1=2;DP4=0,0,24,31;MQ=60;FQ=-193 GT:PL:DP:GQ 1/1:255,166,0:55:99
Cdiff630 2332 . T C 222 . DP=50;VDB=0.0131;AF1=1;AC1=2;DP4=0,0,23,25;MQ=60;FQ=-171 GT:PL:DP:GQ 1/1:255,144,0:48:99
Cdiff630 2443 . T C 222 . DP=59;VDB=0.0136;AF1=1;AC1=2;DP4=0,0,32,26;MQ=60;FQ=-202 GT:PL:DP:GQ 1/1:255,175,0:58:99
Cdiff630 2979 . G A 159 . DP=37;VDB=0.0133;AF1=1;AC1=2;DP4=0,0,20,15;MQ=60;FQ=-132 GT:PL:DP:GQ 1/1:192,105,0:35:99
Cdiff630 3235 . T C 222 . DP=37;VDB=0.0132;AF1=1;AC1=2;DP4=0,0,21,14;MQ=60;FQ=-132 GT:PL:DP:GQ 1/1:255,105,0:35:99
Cdiff630 3304 . A G 55.1 . DP=38;VDB=0.0136;AF1=1;AC1=2;DP4=0,0,5,2;MQ=60;FQ=-48 GT:PL:DP:GQ 1/1:88,21,0:7:39
Cdiff630 3385 . G A 169 . DP=46;VDB=0.0121;AF1=1;AC1=2;DP4=0,0,22,23;MQ=60;FQ=-162 GT:PL:DP:GQ 1/1:202,135,0:45:99
Cdiff630 5602 . A G 149 . DP=45;VDB=0.0105;AF1=1;AC1=2;DP4=0,1,19,25;MQ=60;FQ=-144;PV4=1,1,1,1 GT:PL:DP:GQ 1/1:182,117,0:45:99
Cdiff630 6034 . C T,G 203 . DP=70;VDB=0.0134;AF1=1;AC1=2;DP4=0,0,37,31;MQ=60;FQ=-226 GT:PL:DP:GQ 1/1:236,199,0,239,168,233:68:99
Cdiff630 6310 . C T 217 . DP=62;VDB=0.0113;AF1=1;AC1=2;DP4=0,0,27,32;MQ=60;FQ=-205 GT:PL:DP:GQ 1/1:250,178,0:59:99
Cdiff630 6812 . G A 222 . DP=60;VDB=0.0135;AF1=1;AC1=2;DP4=0,0,32,26;MQ=60;FQ=-202 GT:PL:DP:GQ 1/1:255,175,0:58:99
Cdiff630 7281 . C A 222 . DP=43;VDB=0.0126;AF1=1;AC1=2;DP4=0,0,22,21;MQ=60;FQ=-156 GT:PL:DP:GQ 1/1:255,129,0:43:99
Cdiff630 7427 . G A 195 . DP=41;VDB=0.0134;AF1=1;AC1=2;DP4=0,0,18,19;MQ=60;FQ=-138 GT:PL:DP:GQ 1/1:228,111,0:37:99
Cdiff630 7467 . G A 162 . DP=39;VDB=0.0134;AF1=1;AC1=2;DP4=0,0,19,19;MQ=60;FQ=-141 GT:PL:DP:GQ 1/1:195,114,0:38:99
Cdiff630 7976 . A T 222 . DP=50;VDB=0.0133;AF1=1;AC1=2;DP4=0,0,26,24;MQ=60;FQ=-178 GT:PL:DP:GQ 1/1:255,151,0:50:99
Cdiff630 8171 . T C 222 . DP=42;VDB=0.0135;AF1=1;AC1=2;DP4=1,0,18,21;MQ=60;FQ=-126;PV4=0.48,1,1,1 GT:PL:DP:GQ 1/1:255,99,0:40:99
Cdiff630 8698 . T C 215 . DP=49;VDB=0.0129;AF1=1;AC1=2;DP4=0,0,27,21;MQ=60;FQ=-171 GT:PL:DP:GQ 1/1:248,144,0:48:99
Cdiff630 8770 . T C 222 . DP=48;VDB=0.0133;AF1=1;AC1=2;DP4=0,0,26,21;MQ=60;FQ=-168 GT:PL:DP:GQ 1/1:255,141,0:47:99
Cdiff630 9394 . G A 153 . DP=54;VDB=0.0136;AF1=1;AC1=2;DP4=0,0,21,31;MQ=60;FQ=-184 GT:PL:DP:GQ 1/1:186,157,0:52:99
Calling SNPs with GATK's Unified Genotyper
The GATK developers have recommended local re-alignment around indels as a pre-processing step, because reads whose ends map to the location of an indel can sometimes lead to false positive SNP calls.
In GATK, we can do local re-alignment around potential indel sites, in which we incorporate information about all the reads in that region simultaneously, rather than mapping each read individually.
To use GATK, we have to generate a dictionary file from the database, fix our BAM file by including read groups information otherwise GATK will throw the "SAM/BAM file is malformed" error. We also need to index the resulting file. These steps are as follows:
[uzi@quince-srv2 ~/oneliners/SNPs]$ java -jar $(which CreateSequenceDictionary.jar) R=ref.fa O=ref.dict
[uzi@quince-srv2 ~/oneliners/SNPs]$ java -jar $(which AddOrReplaceReadGroups.jar) I=aln-pe.mapped.sorted.bam O=aln-pe.mapped.sorted.grps.bam LB=whatever PL=illumina PU=whatever SM=whatever
[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools index aln-pe.mapped.sorted.grps.bam
First, GATK needs to figure out which regions are in need of re-alignment:
[uzi@quince-srv2 ~/oneliners/SNPs]$ java -Xmx2g -jar $(which GenomeAnalysisTK.jar) -T RealignerTargetCreator -R ref.fa -I aln-pe.mapped.sorted.grps.bam -o aln-pe.mapped.sorted.grps.realign.intervals
And then it needs to actually do the re-alignment
[uzi@quince-srv2 ~/oneliners/SNPs]$ java -Xmx4g -jar $(which GenomeAnalysisTK.jar) -I aln-pe.mapped.sorted.grps.bam -R ref.fa -T IndelRealigner -targetIntervals aln-pe.mapped.sorted.grps.realign.intervals -o aln-pe.mapped.sorted.grps.realigned.bam
We again fix the resulting BAM file
[uzi@quince-srv2 ~/oneliners/SNPs]$ java -jar $(which AddOrReplaceReadGroups.jar) I=aln-pe.mapped.sorted.grps.realigned.bam O=aln-pe.mapped.sorted.grps.realigned.grps.bam LB=whatever PL=illumina PU=whatever SM=whatever
[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools index aln-pe.mapped.sorted.grps.realigned.grps.bam
We now do the SNP calling on the resultant file
[uzi@quince-srv2 ~/oneliners/SNPs]$ java -jar $(which GenomeAnalysisTK.jar) -R ref.fa -T UnifiedGenotyper -I aln-pe.mapped.sorted.grps.realigned.grps.bam -o aln-pe.mapped.sorted.grps.realigned.grps.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 500 -rf BadCigar
The parameters are slightly different than those used in samtools mpileup. -stand_emit_conf 10.0 means dont report any potential SNPs with a quality below 10.0; but unless they meet the quality threshold set by -stand_call_conf (50.0, in this case), they will be listed as failing the quality filter. -dcov 500 means that any site that has more than 500x coverage, the genotype caller will only use 500 randomly selected reads for computational efficiency.
Now, we check first few records of the generated VCF file.
[uzi@quince-srv2 ~/oneliners/SNPs]$ head -50 aln-pe.mapped.sorted.grps.realigned.grps.vcf
##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine=<ID=UnifiedGenotyper,Version=3.1-1-g07a4bf8,Date="Mon Apr 07 01:40:13 BST 2014",Epoch=1396831213225,CommandLineOptions="analysis_type=UnifiedGenotyper input_file=[aln-pe.mapped.sorted.grps.realigned.grps.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[BadCigar] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=ref.fa nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=500 baq=OFF baqGapOpenPenalty=40.0 fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false genotype_likelihoods_model=SNP pcr_error_rate=1.0E-4 computeSLOD=false annotateNDA=false pair_hmm_implementation=LOGLESS_CACHING min_base_quality_score=17 max_deletion_fraction=0.05 allSitePLs=false min_indel_count_for_genotyping=5 min_indel_fraction_per_sample=0.25 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 indelDebug=false ignoreSNPAlleles=false allReadsSP=false ignoreLaneInfo=false reference_sample_calls=(RodBinding name= source=UNBOUND) reference_sample_name=null sample_ploidy=2 min_quality_score=1 max_quality_score=40 site_quality_prior=20 min_power_threshold_for_calling=0.95 min_reference_depth=100 exclude_filtered_reference_sites=false output_mode=EMIT_VARIANTS_ONLY heterozygosity=0.001 indel_heterozygosity=1.25E-4 genotyping_mode=DISCOVERY standard_min_confidence_threshold_for_calling=50.0 standard_min_confidence_threshold_for_emitting=10.0 alleles=(RodBinding name= source=UNBOUND) max_alternate_alleles=6 input_prior=[] contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=EXACT_INDEPENDENT exactcallslog=null dbsnp=(RodBinding name= source=UNBOUND) comp=[] out=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub no_cmdline_in_header=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub bcf=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub onlyEmitSamples=[] debug_file=null metrics_file=null annotation=[] excludeAnnotation=[] filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat">
##contig=<ID=Cdiff630,length=4290252>
##reference=file:///home/uzi/oneliners/SNPs/ref.fa
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT whatever
Cdiff630 1047 . C T 1896.77 . AC=2;AF=1.00;AN=2;DP=60;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.61 GT:AD:DP:GQ:PL 1/1:0,60:60:99:1925,141,0
Cdiff630 1960 . C T 2161.77 . AC=2;AF=1.00;AN=2;DP=64;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.78 GT:AD:DP:GQ:PL 1/1:0,64:64:99:2190,163,0
Cdiff630 2107 . G A 1907.77 . AC=2;AF=1.00;AN=2;DP=58;Dels=0.00;FS=0.000;HaplotypeScore=3.5489;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.89 GT:AD:DP:GQ:PL 1/1:0,58:58:99:1936,144,0
Cdiff630 2206 . A G 1826.77 . AC=2;AF=1.00;AN=2;DP=55;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.21 GT:AD:DP:GQ:PL 1/1:0,55:55:99:1855,138,0
Cdiff630 2212 . T C 1841.77 . AC=2;AF=1.00;AN=2;DP=56;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.89 GT:AD:DP:GQ:PL 1/1:0,56:56:99:1870,138,0
Cdiff630 2332 . T C 1632.77 . AC=2;AF=1.00;AN=2;DP=50;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.66 GT:AD:DP:GQ:PL 1/1:0,50:50:99:1661,123,0
Cdiff630 2443 . T C 2011.77 . AC=2;AF=1.00;AN=2;DP=59;Dels=0.00;FS=0.000;HaplotypeScore=3.4709;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=34.10 GT:AD:DP:GQ:PL 1/1:0,59:59:99:2040,150,0
Cdiff630 2979 . G A 1155.77 . AC=2;AF=1.00;AN=2;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.24 GT:AD:DP:GQ:PL 1/1:0,37:37:87:1184,87,0
Cdiff630 3235 . T C 1249.77 . AC=2;AF=1.00;AN=2;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.78 GT:AD:DP:GQ:PL 1/1:0,37:37:96:1278,96,0
Cdiff630 3304 . A G 1269.77 . AC=2;AF=1.00;AN=2;DP=38;Dels=0.00;FS=0.000;HaplotypeScore=0.9469;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.41 GT:AD:DP:GQ:PL 1/1:0,38:38:96:1298,96,0
Cdiff630 3385 . G A 1400.77 . AC=2;AF=1.00;AN=2;DP=46;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=30.45 GT:AD:DP:GQ:PL 1/1:0,46:46:99:1429,105,0
Cdiff630 5602 . A G 1362.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.270;DP=45;Dels=0.00;FS=0.000;HaplotypeScore=0.9992;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=0.115;QD=30.28;ReadPosRankSum=1.578 GT:AD:DP:GQ:PL 1/1:1,44:45:99:1391,104,0
Cdiff630 6034 . C T 1983.77 . AC=2;AF=1.00;AN=2;DP=70;Dels=0.00;FS=0.000;HaplotypeScore=1.8317;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.34 GT:AD:DP:GQ:PL 1/1:0,68:70:99:2012,150,0
Cdiff630 6310 . C T 2068.77 . AC=2;AF=1.00;AN=2;DP=62;Dels=0.00;FS=0.000;HaplotypeScore=1.7882;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.37 GT:AD:DP:GQ:PL 1/1:0,62:62:99:2097,159,0
Cdiff630 6812 . G A 1894.77 . AC=2;AF=1.00;AN=2;DP=60;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.58 GT:AD:DP:GQ:PL 1/1:0,60:60:99:1923,141,0
Cdiff630 7281 . C A 1343.77 . AC=2;AF=1.00;AN=2;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.25 GT:AD:DP:GQ:PL 1/1:0,43:43:99:1372,99,0
Cdiff630 7427 . G A 1173.77 . AC=2;AF=1.00;AN=2;DP=41;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.63 GT:AD:DP:GQ:PL 1/1:0,41:41:87:1202,87,0
Cdiff630 7467 . G A 1138.77 . AC=2;AF=1.00;AN=2;DP=39;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=29.20 GT:AD:DP:GQ:PL 1/1:0,39:39:84:1167,84,0
Cdiff630 7976 . A T 1629.77 . AC=2;AF=1.00;AN=2;DP=50;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.60 GT:AD:DP:GQ:PL 1/1:0,50:50:99:1658,123,0
We can check how many SNPs are returned by both approaches:
[uzi@quince-srv2 ~/oneliners/SNPs]$ grep -vc -e "^#" -e "INDEL" aln-pe.mapped.sorted.grps.realigned.grps.vcf
36365
[uzi@quince-srv2 ~/oneliners/SNPs]$ grep -vc -e "^#" -e "INDEL" aln-pe.mapped.sorted.vcf
31413
[uzi@quince-srv2 ~/oneliners/SNPs]$ grep -vc -e "^#" aln-pe.mapped.sorted.vcf
32093
We can use VCFtools which are very useful for manipulating and comparing VCF files. For example, the above step could be done using VCFtools as
[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.vcf
VCFtools - v0.1.12a
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf aln-pe.mapped.sorted.vcf
After filtering, kept 1 out of 1 Individuals
After filtering, kept 32093 out of a possible 32093 Sites
Run Time = 0.00 seconds
[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf
VCFtools - v0.1.12a
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf
After filtering, kept 1 out of 1 Individuals
After filtering, kept 36365 out of a possible 36365 Sites
Run Time = 0.00 seconds
Our genome is 4290252 bp long. We can even see how many SNPs are there in first 10000bps
[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf --chr Cdiff630 --from-bp 1 --to-bp 10000
VCFtools - v0.1.12a
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf
--chr Cdiff630
--to-bp 10000
--from-bp 1
After filtering, kept 1 out of 1 Individuals
After filtering, kept 23 out of a possible 36365 Sites
Run Time = 0.00 seconds
VCFtools can perform analyses on the variants that pass through the filters or simply write those variants out to a new file. We can create subsets of VCF files or just removing unwanted variants from VCF files. To write out the variants that pass through filters use the --recode option. In addition, use --recode-INFO-all to include all data from the INFO fields in the output.
It will generate a file out.recode.vcf in the current folder
[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf --chr Cdiff630 --from-bp 1 --to-bp 10000 --recode --recode-INFO-all
[uzi@quince-srv2 ~/oneliners/SNPs]$ grep -vi "^#" out.recode.vcf
Cdiff630 1047 . C T 1896.77 . AC=2;AF=1.00;AN=2;DP=60;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.61 GT:AD:DP:GQ:PL 1/1:0,60:60:99:1925,141,0
Cdiff630 1960 . C T 2161.77 . AC=2;AF=1.00;AN=2;DP=64;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.78 GT:AD:DP:GQ:PL 1/1:0,64:64:99:2190,163,0
Cdiff630 2107 . G A 1907.77 . AC=2;AF=1.00;AN=2;DP=58;Dels=0.00;FS=0.000;HaplotypeScore=3.5489;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.89 GT:AD:DP:GQ:PL 1/1:0,58:58:99:1936,144,0
Cdiff630 2206 . A G 1826.77 . AC=2;AF=1.00;AN=2;DP=55;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.21 GT:AD:DP:GQ:PL 1/1:0,55:55:99:1855,138,0
Cdiff630 2212 . T C 1841.77 . AC=2;AF=1.00;AN=2;DP=56;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.89 GT:AD:DP:GQ:PL 1/1:0,56:56:99:1870,138,0
Cdiff630 2332 . T C 1632.77 . AC=2;AF=1.00;AN=2;DP=50;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.66 GT:AD:DP:GQ:PL 1/1:0,50:50:99:1661,123,0
Cdiff630 2443 . T C 2011.77 . AC=2;AF=1.00;AN=2;DP=59;Dels=0.00;FS=0.000;HaplotypeScore=3.4709;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=34.10 GT:AD:DP:GQ:PL 1/1:0,59:59:99:2040,150,0
Cdiff630 2979 . G A 1155.77 . AC=2;AF=1.00;AN=2;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.24 GT:AD:DP:GQ:PL 1/1:0,37:37:87:1184,87,0
Cdiff630 3235 . T C 1249.77 . AC=2;AF=1.00;AN=2;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.78 GT:AD:DP:GQ:PL 1/1:0,37:37:96:1278,96,0
Cdiff630 3304 . A G 1269.77 . AC=2;AF=1.00;AN=2;DP=38;Dels=0.00;FS=0.000;HaplotypeScore=0.9469;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.41 GT:AD:DP:GQ:PL 1/1:0,38:38:96:1298,96,0
Cdiff630 3385 . G A 1400.77 . AC=2;AF=1.00;AN=2;DP=46;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=30.45 GT:AD:DP:GQ:PL 1/1:0,46:46:99:1429,105,0
Cdiff630 5602 . A G 1362.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.270;DP=45;Dels=0.00;FS=0.000;HaplotypeScore=0.9992;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=0.115;QD=30.28;ReadPosRankSum=1.578 GT:AD:DP:GQ:PL 1/1:1,44:45:99:1391,104,0
Cdiff630 6034 . C T 1983.77 . AC=2;AF=1.00;AN=2;DP=70;Dels=0.00;FS=0.000;HaplotypeScore=1.8317;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.34 GT:AD:DP:GQ:PL 1/1:0,68:70:99:2012,150,0
Cdiff630 6310 . C T 2068.77 . AC=2;AF=1.00;AN=2;DP=62;Dels=0.00;FS=0.000;HaplotypeScore=1.7882;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.37 GT:AD:DP:GQ:PL 1/1:0,62:62:99:2097,159,0
Cdiff630 6812 . G A 1894.77 . AC=2;AF=1.00;AN=2;DP=60;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.58 GT:AD:DP:GQ:PL 1/1:0,60:60:99:1923,141,0
Cdiff630 7281 . C A 1343.77 . AC=2;AF=1.00;AN=2;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.25 GT:AD:DP:GQ:PL 1/1:0,43:43:99:1372,99,0
Cdiff630 7427 . G A 1173.77 . AC=2;AF=1.00;AN=2;DP=41;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.63 GT:AD:DP:GQ:PL 1/1:0,41:41:87:1202,87,0
Cdiff630 7467 . G A 1138.77 . AC=2;AF=1.00;AN=2;DP=39;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=29.20 GT:AD:DP:GQ:PL 1/1:0,39:39:84:1167,84,0
Cdiff630 7976 . A T 1629.77 . AC=2;AF=1.00;AN=2;DP=50;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.60 GT:AD:DP:GQ:PL 1/1:0,50:50:99:1658,123,0
Cdiff630 8171 . T C 1353.77 . AC=2;AF=1.00;AN=2;BaseQRankSum=1.568;DP=42;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=-1.155;QD=32.23;ReadPosRankSum=0.908 GT:AD:DP:GQ:PL 1/1:1,41:42:83:1382,83,0
Cdiff630 8698 . T C 1662.77 . AC=2;AF=1.00;AN=2;DP=49;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.93 GT:AD:DP:GQ:PL 1/1:0,49:49:99:1691,126,0
Cdiff630 8770 . T C 1672.77 . AC=2;AF=1.00;AN=2;DP=48;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=34.85 GT:AD:DP:GQ:PL 1/1:0,48:48:99:1701,126,0
Cdiff630 9394 . G A 1741.77 . AC=2;AF=1.00;AN=2;DP=55;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.67 GT:AD:DP:GQ:PL 1/1:0,54:54:99:1770,129,0
We can also determine which sites and individuals are shared between the VCF files generated from the two approaches
[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.vcf --diff aln-pe.mapped.sorted.grps.realigned.grps.vcf --out compare
This will generate two files compare.diff.indv_in_files and compare.diff.sites_in_files. Now check compare.diff.sites_in_files
[uzi@quince-srv2 ~/oneliners/SNPs]$ head compare.diff.sites_in_files
CHROM POS IN_FILE REF ALT1 ALT2
Cdiff630 1047 B C T T
Cdiff630 1960 B C T T
Cdiff630 2107 B G A A
Cdiff630 2206 B A G G
Cdiff630 2212 B T C C
Cdiff630 2332 B T C C
Cdiff630 2443 B T C C
Cdiff630 2979 B G A A
Cdiff630 3235 B T C C
We can also summarizes the sequencing depth
[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf --depth -c > depth_summary.txt
[uzi@quince-srv2 ~/oneliners/SNPs]$ cat depth_summary.txt
INDV N_SITES MEAN_DEPTH
whatever 36365 44.2186
To write out site-wise sequence depths only at sites that have no missing data, include the --max-missing argument. This will produce site_depth_summary.ldepth in the current folder
[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf --site-depth --max-missing 1.0 --out site_depth_summary
VCFtools - v0.1.12a
(C) Adam Auton and Anthony Marcketta 2009
Parameters as interpreted:
--vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf
--max-missing 1
--out site_depth_summary
--site-depth
After filtering, kept 1 out of 1 Individuals
Outputting Depth for Each Site
After filtering, kept 36365 out of a possible 36365 Sites
Run Time = 0.00 seconds
[uzi@quince-srv2 ~/oneliners/SNPs]$ head site_depth_summary.ldepth
CHROM POS SUM_DEPTH SUMSQ_DEPTH
Cdiff630 1047 60 3600
Cdiff630 1960 64 4096
Cdiff630 2107 58 3364
Cdiff630 2206 55 3025
Cdiff630 2212 56 3136
Cdiff630 2332 50 2500
Cdiff630 2443 59 3481
Cdiff630 2979 37 1369
Cdiff630 3235 37 1369
VCFtools can also convert VCF files into other formats, for example, PED and MAP files
[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf --plink --chr 1 --out output_in_plink
Finally, to view everything together, we can use Interactive Genome Viewer, which will require four files: ref.fa, aln-pe.mapped.sorted.grps.realigned.grps.vcf, aln-pe.mapped.sorted.grps.realigned.grps.bam, and aln-pe.mapped.sorted.grps.realigned.grps.bam.bai.
In the figure above, the red rectangles on the top panel are the variants we found using our analysis, and the bottom panel shows aligned reads to the genome region along with coverage information.
I have a SAM file that I generated by aligning paired-end reads against the reference database:
[uzi@quince-srv2 ~/oneliners]$ bwa mem ref.fasta forward.fastq reverse.fastq > aln-pe.sam
The resulting SAM file has 4806207 records (4 million records) and it took 2 minutes and 7.15 seconds to get the linkage information (i.e. reads that match multiple contigs/genomes) using the following one-liner:
[uzi@quince-srv2 ~/oneliners]$ awk '$2~/^99$|^147$|^83$|^163$|^67$|^131$|^115$|^179$|^81$|^161$|^97$|^145$|^65$|^129$|^113$|^177$/ && $2!~/^SN/{print $1"\t"$3"\t"$7}' aln-pe.sam | sort -n -k1,1 | uniq | awk 'BEGIN { FS="\t" } { c[$1]++; l[$1,c[$1]]=$0 } END { for (i in c) { if (c[i] > 1) for (j = 1; j <= c[i]; j++) print l[i,j] } }'
HWI-ST1242:77:C13H3ACXX:7:2206:11068:81469 Acidobacterium_capsulatum_ATCC_51196 Bordetella_bronchiseptica_strain_RB50
HWI-ST1242:77:C13H3ACXX:7:2206:11068:81469 Bordetella_bronchiseptica_strain_RB50 Acidobacterium_capsulatum_ATCC_51196
HWI-ST1242:77:C13H3ACXX:7:1306:17569:74649 Thermotoga_petrophila_RKU-1 Thermotoga_sp._RQ2
HWI-ST1242:77:C13H3ACXX:7:1306:17569:74649 Thermotoga_sp._RQ2 Thermotoga_petrophila_RKU-1
HWI-ST1242:77:C13H3ACXX:7:2101:13560:96472 Thermotoga_petrophila_RKU-1 Thermotoga_sp._RQ2
HWI-ST1242:77:C13H3ACXX:7:2101:13560:96472 Thermotoga_sp._RQ2 Thermotoga_petrophila_RKU-1
HWI-ST1242:77:C13H3ACXX:7:1209:7458:61056 Treponema_denticola_ATCC_35405 Treponema_vincentii_ATCC_35580_NZACYH00000000.1
HWI-ST1242:77:C13H3ACXX:7:1209:7458:61056 Treponema_vincentii_ATCC_35580_NZACYH00000000.1 Treponema_denticola_ATCC_35405
HWI-ST1242:77:C13H3ACXX:7:1301:15983:83574 Bordetella_bronchiseptica_strain_RB50 Rhodopirellula_baltica_SH_1_complete_genome
HWI-ST1242:77:C13H3ACXX:7:1301:15983:83574 Rhodopirellula_baltica_SH_1_complete_genome Bordetella_bronchiseptica_strain_RB50
HWI-ST1242:77:C13H3ACXX:7:1103:12322:58936 Bordetella_bronchiseptica_strain_RB50 Desulfovibrio_piger_ATCC_29098
HWI-ST1242:77:C13H3ACXX:7:1103:12322:58936 Desulfovibrio_piger_ATCC_29098 Bordetella_bronchiseptica_strain_RB50
HWI-ST1242:77:C13H3ACXX:7:1210:12644:24976 Bacteroides_thetaiotaomicron_VPI-5482 Thermoanaerobacter_pseudethanolicus_ATCC_33223
HWI-ST1242:77:C13H3ACXX:7:1210:12644:24976 Thermoanaerobacter_pseudethanolicus_ATCC_33223 Bacteroides_thetaiotaomicron_VPI-5482
HWI-ST1242:77:C13H3ACXX:7:2108:8801:73269 Sulfurihydrogenibium_yellowstonense_SS-5 SulfuriYO3AOP1
HWI-ST1242:77:C13H3ACXX:7:2108:8801:73269 SulfuriYO3AOP1 Sulfurihydrogenibium_yellowstonense_SS-5
HWI-ST1242:77:C13H3ACXX:7:2108:7167:90955 Salinispora_arenicola_CNS-205 Shewanella_baltica_OS223,
HWI-ST1242:77:C13H3ACXX:7:2108:7167:90955 Shewanella_baltica_OS223, Salinispora_arenicola_CNS-205
HWI-ST1242:77:C13H3ACXX:7:2303:16257:47096 Shewanella_baltica_OS185 Shewanella_baltica_OS223,
HWI-ST1242:77:C13H3ACXX:7:2303:16257:47096 Shewanella_baltica_OS223, Shewanella_baltica_OS185
HWI-ST1242:77:C13H3ACXX:7:1305:15998:77806 Sulfurihydrogenibium_yellowstonense_SS-5 SulfuriYO3AOP1
HWI-ST1242:77:C13H3ACXX:7:1305:15998:77806 SulfuriYO3AOP1 Sulfurihydrogenibium_yellowstonense_SS-5
HWI-ST1242:77:C13H3ACXX:7:1215:18228:5604 Rhodopirellula_baltica_SH_1_complete_genome Salinispora_arenicola_CNS-205
HWI-ST1242:77:C13H3ACXX:7:1215:18228:5604 Salinispora_arenicola_CNS-205 Rhodopirellula_baltica_SH_1_complete_genome
HWI-ST1242:77:C13H3ACXX:7:2109:5989:11352 Bordetella_bronchiseptica_strain_RB50 Leptothrix_cholodnii_SP-6
HWI-ST1242:77:C13H3ACXX:7:2109:5989:11352 Leptothrix_cholodnii_SP-6 Bordetella_bronchiseptica_strain_RB50
HWI-ST1242:77:C13H3ACXX:7:1203:6513:69746 Treponema_vincentii_ATCC_35580_NZACYH00000000.1 =
HWI-ST1242:77:C13H3ACXX:7:1203:6513:69746 Treponema_vincentii_ATCC_35580_NZACYH00000000.1 Treponema_denticola_ATCC_35405
HWI-ST1242:77:C13H3ACXX:7:2202:20027:26498 Treponema_denticola_ATCC_35405 Treponema_vincentii_ATCC_35580_NZACYH00000000.1
HWI-ST1242:77:C13H3ACXX:7:2202:20027:26498 Treponema_vincentii_ATCC_35580_NZACYH00000000.1 Treponema_denticola_ATCC_35405
HWI-ST1242:77:C13H3ACXX:7:2211:7978:77793 Akkermansia_muciniphila_ATCC_BAA-835 Bordetella_bronchiseptica_strain_RB50
HWI-ST1242:77:C13H3ACXX:7:2211:7978:77793 Bordetella_bronchiseptica_strain_RB50 Akkermansia_muciniphila_ATCC_BAA-835
HWI-ST1242:77:C13H3ACXX:7:2201:12614:25584 Thermotoga_petrophila_RKU-1 Thermotoga_sp._RQ2
HWI-ST1242:77:C13H3ACXX:7:2201:12614:25584 Thermotoga_sp._RQ2 Thermotoga_petrophila_RKU-1
...
...
where 1st column is the read identifier (the read identifiers will repeat on subsequent rows and may have a multiplicity of 2 or more), 2nd and 3rd column give linkage information. For example, let's grep the records from the SAM file using one of the identifiers given above:
[uzi@quince-srv2 ~/oneliners]$ grep -i "HWI-ST1242:77:C13H3ACXX:7:1203:6513:69746" *.sam
HWI-ST1242:77:C13H3ACXX:7:1203:6513:69746 65 Treponema_vincentii_ATCC_35580_NZACYH00000000.1 1506973 41 77M24S Treponema_denticola_ATCC_35405 57994 0 AAAATGTACATCCCTGTACATTTTTTACCTACGTGTTTTCAAGGCAAACACACTACTGATTATAACCACCCCCATCCCCCCCCCAAAACATACATCCCTGT @CCFFFDDHHHHHJJJJJJJJJJJJJJJIJIJJJD11?0*00*0)0?G;D################################################### NM:i:4 AS:i:57 XS:i:33
HWI-ST1242:77:C13H3ACXX:7:1203:6513:69746 129 Treponema_denticola_ATCC_35405 57994 30 39S56M6S Treponema_vincentii_ATCC_35580_NZACYH00000000.1 1506973 0 GGCAAAATCAGGTAATCCTTTTGAAATATGCGGGGCATCTACTGCGTCGTTACGCAAAAGCGAATGCTCAACGTACAGCAAGTACGCCTCCGCTTCGCTTT BCCFFFFFHHHHFHHIIIJJJJJIJJJIJJJJJJJGIGIIGIGIEE;BDDAACA@?BBD>9=8@5?::>AC?588A243(934>C79@89A9309559588 NM:i:3 AS:i:41 XS:i:27 XP:Z:Treponema_vincentii_ATCC_35580_NZACYH00000000.1,-1507188,61S40M,30,0;
HWI-ST1242:77:C13H3ACXX:7:1203:6513:69746 145 Treponema_vincentii_ATCC_35580_NZACYH00000000.1 1507188 30 61S40M = 1506973 -255 AAAGCGAAGCGGAGGCGTACTTGCTGTACGTTGAGCATTCGCTTTTGCGTAACGACGCAGTAGATGCCCCGCATATTTCAAAAGGATTACCTGATTTTGCC 8859559039A98@97C>439(342A885?CA>::?5@8=9>DBB?@ACAADDB;EEIGIGIIGIGJJJJJJJIJJJIJJJJJIIIHHFHHHHFFFFFCCB NM:i:0 AS:i:40 XS:i:29 XP:Z:Treponema_denticola_ATCC_35405,+57994,39S56M6S,30,3;
Here, the read is paired-end read 1 if it is the member of the set (99,83,67,115,81,97,65,113) and is paired-end read 2 if it is the member of the set (147,163,131,179,161,145,129,177), respectively. So the first record returned from the grep statement is paired-end read 1 and next two records are paired-end read 2. You can just get rid of redundant column $7 is "=" record and run the query as (it ran in 7.7 seconds):
[uzi@quince-srv2 ~/oneliners]$ awk '$2~/^99$|^147$|^83$|^163$|^67$|^131$|^115$|^179$|^81$|^161$|^97$|^145$|^65$|^129$|^113$|^177$/ && $2!~/^SN/ && $7!~/=/{print $1"\t"$3"\t"$7}' aln-pe.sam | sort -n -k1,1 | uniq | awk 'BEGIN { FS="\t" } { c[$1]++; l[$1,c[$1]]=$0 } END { for (i in c) { if (c[i] > 1) for (j = 1; j <= c[i]; j++) print l[i,j] } }'
HWI-ST1242:77:C13H3ACXX:7:2207:10952:95139 Persephonella_marina_EX-H1 Thermoanaerobacter_pseudethanolicus_ATCC_33223
HWI-ST1242:77:C13H3ACXX:7:2207:10952:95139 Thermoanaerobacter_pseudethanolicus_ATCC_33223 Persephonella_marina_EX-H1
HWI-ST1242:77:C13H3ACXX:7:2216:3177:14124 Shewanella_baltica_OS185 Shewanella_baltica_OS223,
HWI-ST1242:77:C13H3ACXX:7:2216:3177:14124 Shewanella_baltica_OS223, Shewanella_baltica_OS185
HWI-ST1242:77:C13H3ACXX:7:2205:11785:35383 Deinococcus_radiodurans_R1_chromosome_1,_complete_sequence Leptothrix_cholodnii_SP-6
HWI-ST1242:77:C13H3ACXX:7:2205:11785:35383 Leptothrix_cholodnii_SP-6 Deinococcus_radiodurans_R1_chromosome_1,_complete_sequence
HWI-ST1242:77:C13H3ACXX:7:2104:19027:61363 Sulfurihydrogenibium_yellowstonense_SS-5 SulfuriYO3AOP1
HWI-ST1242:77:C13H3ACXX:7:2104:19027:61363 SulfuriYO3AOP1 Sulfurihydrogenibium_yellowstonense_SS-5
HWI-ST1242:77:C13H3ACXX:7:1301:5017:76112 Sulfurihydrogenibium_yellowstonense_SS-5 SulfuriYO3AOP1
HWI-ST1242:77:C13H3ACXX:7:1301:5017:76112 SulfuriYO3AOP1 Sulfurihydrogenibium_yellowstonense_SS-5
HWI-ST1242:77:C13H3ACXX:7:2316:4885:64809 Sulfurihydrogenibium_yellowstonense_SS-5 SulfuriYO3AOP1
HWI-ST1242:77:C13H3ACXX:7:2316:4885:64809 SulfuriYO3AOP1 Sulfurihydrogenibium_yellowstonense_SS-5
HWI-ST1242:77:C13H3ACXX:7:2305:6004:6780 Acidobacterium_capsulatum_ATCC_51196 Bordetella_bronchiseptica_strain_RB50
HWI-ST1242:77:C13H3ACXX:7:2305:6004:6780 Bordetella_bronchiseptica_strain_RB50 Acidobacterium_capsulatum_ATCC_51196
HWI-ST1242:77:C13H3ACXX:7:2302:12911:73962 Gemmatimonas_aurantiaca_T-27_DNA gi|83591340|ref|NC_007643.1|
HWI-ST1242:77:C13H3ACXX:7:2302:12911:73962 gi|83591340|ref|NC_007643.1| Gemmatimonas_aurantiaca_T-27_DNA
HWI-ST1242:77:C13H3ACXX:7:1205:3889:3509 Thermotoga_neapolitana_DSM_4359 Thermotoga_sp._RQ2
HWI-ST1242:77:C13H3ACXX:7:1205:3889:3509 Thermotoga_sp._RQ2 Thermotoga_neapolitana_DSM_4359
The purpose of the final awk statment is to remove singletons i.e. get rid of reads where both paired-ends match to the same genome/contig. Refer to the following table on how I have used the flags:
I assembled Nanoarchaeum Equitans for which there were 1639910 trimmed forward and reverse reads (Trimming was done using Sickle version 1.200 with a minimum window quality score of 20. Reads shorter than 10 bp. after trimming were removed). This dataset was sequenced on MISEQ machine using Nextera library.
I used all 1639910 reads in VelvetOptimiser with n50 score as an optimisation criteria and kmer range as [21,99]. In the final assembly, I found optimum kmer size=99 and obtained 13795 contigs with an n50 score of 201 which was NOT ACCEPTABLE. Neil Hall suggested that perhaps, the problem is that we have too much coverage which is propagating errors because of depth of assignments and may be subsampling is the way to go.
So I subsampled from the reads taking 10% (163991), 20% (327982), 30% (491973), 40% (655964), and 50% (819955), 60% (983946), and 70% (1147937) of the reads respectively using reservoir sampling
Subsampling 10% reads:
[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=163991 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p10.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p10.fastq"}'
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p10.fastq NE_R2_p10.fastq' -t 10 --optFuncKmer 'n50' &
Subsampling 20% reads:
[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=327982 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p20.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p20.fastq"}'
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p20.fastq NE_R2_p20.fastq' -t 10 --optFuncKmer 'n50' &
Subsampling 30% reads:
[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=491973 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p30.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p30.fastq"}'
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p30.fastq NE_R2_p30.fastq' -t 10 --optFuncKmer 'n50' &
Subsampling 40% reads:
[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=655964 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p40.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p40.fastq"}'
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p40.fastq NE_R2_p40.fastq' -t 10 --optFuncKmer 'n50' &
Subsampling 50% reads:
[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=819955 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p50.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p50.fastq"}'
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p50.fastq NE_R2_p50.fastq' -t 10 --optFuncKmer 'n50' &
Subsampling 60% reads:
[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=983946 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p60.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p60.fastq"}'
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p60.fastq NE_R2_p60.fastq' -t 10 --optFuncKmer 'n50' &
Subsampling 70% reads:
[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=1147937 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p70.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p70.fastq"}'
[uzi@quince-srv2 ~/oneliners]$nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p70.fastq NE_R2_p70.fastq' -t 10 --optFuncKmer 'n50' &
The results are given below:
% of reads |
Optimum kmer size |
No of contigs |
n50 score |
Length of longest contig |
Number of contigs > 1k |
100% |
99 |
13795 |
201 |
1950 |
28 |
70% |
99 |
2386 |
1564 |
288878 |
18 |
60% |
99 |
611 |
396479 |
396479 |
11 |
50% |
95 |
151 |
489315 |
489315 |
14 |
40% |
95 |
265 |
451202 |
451202 |
16 |
30% |
95 |
661 |
457328 |
457328 |
10 |
20% |
99 |
330 |
101099 |
131609 |
16 |
10% |
95 |
1277 |
2263 |
13314 |
121 |
To validate assembly, we have used the paper Read Length and Repeat Resolution: Exploring Prokaryote Genomes Using Next-Generation Sequencing Technologies. Table S1 from supplementary data is an excel sheet that gives read length and gap benchmarks for 818 bacterial genomes. From there, for the closely-related strain Nanoarchaeum equitans Kin4-M (genome length: 490885), with 125bp reads, 2 gaps are predicted, and with 250bp reads, 0 gaps are predicted in final assembly.
It can be seen that with 50% of the reads, the longest contig 489315 is 99.68017% of the genome length of Nanoarchaeum equitans Kin4-M. And so the verdict is that: Yes SUBSAMPLING does improve assembly quality for very high coverage datasets when using Velvet without removing noise from reads.
For each command given below, the first blastdbcmd invocation produces 2 entries per sequence (GI and title of the sequence), the awk command selects from the output of that command those sequences which have "16S" in their title and prints their GIs, and finally the second blastdbcmd invocation uses those GIs to print the sequence data for the 16S rRNA sequences in the nt database. Since nt database is huge, the methods given below use nohup to run the command in the background and they will continue even if you log off from a remote SSH session.
Method 1: Specify the organism names in the command itself:
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/ncbi-blast-2.2.28+/bin/blastdbcmd -db /home/opt/ncbi-blast-2.2.28+/db/nt -entry all -outfmt "%g;;%t" | awk -F";;" '/16S/ && /Archaeoglobus fulgidus|Desulfovibrio piger|Herpetosiphon aurantiacus|Ignicoccus hospitalis|Methanocaldococcus jannaschii|Methanococcus maripaludis|Nanoarchaeum equitans|Pyrobaculum aerophilum|Pyrobaculum calidifontis|Pyrococcus horikoshii|Rhodopirellula baltica|Sulfolobus tokodaii/{print $1}' | /home/opt/ncbi-blast-2.2.28+/bin/blastdbcmd -db /home/opt/ncbi-blast-2.2.28+/db/nt -entry_batch - -out sequences.fasta &
Method 2: Specify the organism names in a file organisms.txt:
[uzi@quince-srv2 ~/oneliners]$ cat organisms.txt
Archaeoglobus fulgidus
Desulfovibrio piger
Herpetosiphon aurantiacus
Ignicoccus hospitalis
Methanocaldococcus jannaschii
Methanococcus maripaludis
Nanoarchaeum equitans
Pyrobaculum aerophilum
Pyrobaculum calidifontis
Pyrococcus horikoshii
Rhodopirellula baltica
Sulfolobus tokodaii
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/ncbi-blast-2.2.28+/bin/blastdbcmd -db /home/opt/ncbi-blast-2.2.28+/db/nt -entry all -outfmt "%g;;%t" | grep -F "$(<organisms.txt)" | awk -F";;" '/16S/{print $1}' | /home/opt/ncbi-blast-2.2.28+/bin/blastdbcmd -db /home/opt/ncbi-blast-2.2.28+/db/nt -entry_batch - -out sequences2.fasta &
Now check the header information of the sequences we have pulled out:
[uzi@quince-srv2 ~/oneliners]$ grep -i ">" sequences.fasta
>gi|444303911|ref|NR_074334.1| Archaeoglobus fulgidus DSM 4304 strain DSM 4304 16S ribosomal RNA, complete sequence >gi|38792|emb|X05567.1| Archaeoglobus fulgidus strain VC-16 16S ribosomal RNA gene
>gi|444303911|ref|NR_074334.1| Archaeoglobus fulgidus DSM 4304 strain DSM 4304 16S ribosomal RNA, complete sequence >gi|38792|emb|X05567.1| Archaeoglobus fulgidus strain VC-16 16S ribosomal RNA gene
>gi|38805|emb|Y00275.1| Archaeoglobus fulgidus (VC-16) DNA for 16S ribosomal RNA
>gi|294440|gb|L07510.1|PYBRGDX Pyrobaculum aerophilum 16S ribosomal RNA gene
>gi|1145365|gb|U38484.1|MMU38484 Methanococcus maripaludis 16S ribosomal RNA, complete sequence
>gi|1145366|gb|U38486.1|MMU38486 Methanococcus maripaludis 16S ribosomal RNA, complete sequence
>gi|1145367|gb|U38487.1|MMU38487 Methanococcus maripaludis 16S ribosomal RNA, complete sequence
>gi|1145368|gb|U38941.1|MMU38941 Methanococcus maripaludis 16S ribosomal RNA, complete sequence
>gi|2232234|gb|AF005049.1|AF005049 Methanococcus maripaludis 16S ribosomal RNA gene, complete sequence
>gi|2564752|dbj|D45214.1|PYWOT3 Pyrococcus horikoshii 16S rRNA gene, complete sequence
>gi|343201091|ref|NR_041778.1| Desulfovibrio piger ATCC 29098 strain ATCC29098 16S ribosomal RNA, complete sequence >gi|6979515|gb|AF192152.1|AF192152 Desulfomonas pigra 16S ribosomal RNA gene, complete sequence
>gi|18151366|dbj|AB022438.1| Sulfolobus tokodaii genes for 16S rRNA, 23S rRNA, complete and partial sequence
>gi|20429035|emb|AJ318041.1| Nanoarchaeum equitans Kin4-M 16S rRNA gene, strain Kin4-M
>gi|265678650|ref|NR_028955.1| Ignicoccus hospitalis KIN4/I 16S ribosomal RNA, partial sequence >gi|20429036|emb|AJ318042.1| Ignicoccus sp. Kin4-I 16S rRNA gene, strain Kin4-I
>gi|343200235|ref|NR_040922.1| Pyrobaculum calidifontis JCM 11548 strain VA1 16S ribosomal RNA, complete sequence >gi|27807847|dbj|AB078332.1| Pyrobaculum calidifontis gene for 16S rRNA, complete sequence
>gi|343200235|ref|NR_040922.1| Pyrobaculum calidifontis JCM 11548 strain VA1 16S ribosomal RNA, complete sequence >gi|27807847|dbj|AB078332.1| Pyrobaculum calidifontis gene for 16S rRNA, complete sequence
>gi|2913866|dbj|D87344.1| Pyrococcus horikoshii gene for 16S ribosomal RNA, partial sequence
>gi|89077582|gb|DQ374392.1| Archaeoglobus fulgidus strain L3 16S ribosomal RNA gene, partial sequence
>gi|265678749|ref|NR_029054.1| Pyrococcus horikoshii strain JA-1 16S ribosomal RNA, partial sequence >gi|37543825|gb|AY099169.1| Pyrococcus horikoshii strain DSM 12428 16S ribosomal RNA gene, partial sequence
>gi|265678749|ref|NR_029054.1| Pyrococcus horikoshii strain JA-1 16S ribosomal RNA, partial sequence >gi|37543825|gb|AY099169.1| Pyrococcus horikoshii strain DSM 12428 16S ribosomal RNA gene, partial sequence
>gi|37543862|gb|AY099205.1| Pyrococcus horikoshii strain DSM 12428 16S-23S ribosomal RNA intergenic spacer region, complete sequence
>gi|116496446|gb|EF012748.1| Rhodopirellula baltica strain OJF4 16S ribosomal RNA gene, partial sequence
>gi|116496447|gb|EF012749.1| Rhodopirellula baltica strain OJF5 16S ribosomal RNA gene, partial sequence
>gi|116496448|gb|EF012750.1| Rhodopirellula baltica strain OJF6 16S ribosomal RNA gene, partial sequence
>gi|126211523|gb|EF421451.1| Rhodopirellula baltica strain OJF13 16S ribosomal RNA gene, partial sequence
>gi|126211524|gb|EF421452.1| Rhodopirellula baltica strain OJF14 16S ribosomal RNA gene, partial sequence
>gi|126211525|gb|EF421453.1| Rhodopirellula baltica strain OJF15 16S ribosomal RNA gene, partial sequence
>gi|148357800|gb|EF589343.1| Rhodopirellula baltica strain F1 16S ribosomal RNA gene, partial sequence
>gi|148357801|gb|EF589344.1| Rhodopirellula baltica strain Gtu1 16S ribosomal RNA gene, partial sequence
>gi|148357802|gb|EF589345.1| Rhodopirellula baltica strain G1 16S ribosomal RNA gene, partial sequence
>gi|148357804|gb|EF589347.1| Rhodopirellula baltica strain M5 16S ribosomal RNA gene, partial sequence
>gi|148357806|gb|EF589349.1| Rhodopirellula baltica strain Gr16 16S ribosomal RNA gene, partial sequence
>gi|148357811|gb|EF589354.1| Rhodopirellula baltica strain Cc1 16S ribosomal RNA gene, partial sequence
>gi|171879796|gb|EU545801.1| Sulfolobus tokodaii strain TW 16S ribosomal RNA gene, partial sequence
...
Download BioSQL from here. Once the software is installed, setup a database and import the BioSQL schema. The following command line should create a new database on your own computer called bioseqdb, belonging to the root user account:
[uzi@quince-srv2 ~/oneliners]$ mysqladmin -u root create bioseqdb
We can then tell MySQL to load the BioSQL scheme we downloaded above. Change to the sql subdirectory from the unzipped BioSQL download, then:
[uzi@quince-srv2 ~/oneliners/biosql-1.0.1/sql]$ mysql -u root bioseqdb < biosqldb-mysql.sql
To update the NCBI taxonomy, change to the scripts subdirectory and use the following command:
[uzi@quince-srv2 ~/oneliners/biosql-1.0.1/scripts]$ ./load_ncbi_taxonomy.pl --dbname bioseqdb --driver mysql --dbuser root --download true
Loading NCBI taxon database in taxdata:
... retrieving all taxon nodes in the database
... reading in taxon nodes from nodes.dmp
... insert / update / delete taxon nodes
... (committing nodes)
... rebuilding nested set left/right values
... reading in taxon names from names.dmp
... deleting old taxon names
... inserting new taxon names
... cleaning up
Done.
Go to oneliners folder and check which tables are imported:
[uzi@quince-srv2 ~/oneliners]$ mysql --user=root bioseqdb -e "show tables"
+----------------------------+
| Tables_in_bioseqdb |
+----------------------------+
| biodatabase |
| bioentry |
| bioentry_dbxref |
| bioentry_path |
| bioentry_qualifier_value |
| bioentry_reference |
| bioentry_relationship |
| biosequence |
| comment |
| dbxref |
| dbxref_qualifier_value |
| location |
| location_qualifier_value |
| ontology |
| reference |
| seqfeature |
| seqfeature_dbxref |
| seqfeature_path |
| seqfeature_qualifier_value |
| seqfeature_relationship |
| taxon |
| taxon_name |
| term |
| term_dbxref |
| term_path |
| term_relationship |
| term_relationship_term |
| term_synonym |
+----------------------------+
Check how many databases are there:
[uzi@quince-srv2 ~/oneliners]$ mysql --user=root bioseqdb
mysql> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| bioseqdb |
| mysql |
| test |
+--------------------+
mysql> use bioseqdb
Database changed
Check the attributes of the table "taxon":
mysql> select column_name from information_schema.columns where table_name='taxon';
+-------------------+
| column_name |
+-------------------+
| taxon_id |
| ncbi_taxon_id |
| parent_taxon_id |
| node_rank |
| genetic_code |
| mito_genetic_code |
| left_value |
| right_value |
+-------------------+
Check the attributes of the table "taxon_name":
mysql> select column_name from information_schema.columns where table_name='taxon_name';
+-------------+
| column_name |
+-------------+
| taxon_id |
| name |
| name_class |
+-------------+
Check if the tables are populated:
mysql> SELECT ncbi_taxon_id from taxon LIMIT 10;
+---------------+
| ncbi_taxon_id |
+---------------+
| 1 |
| 2 |
| 6 |
| 7 |
| 9 |
| 10 |
| 11 |
| 13 |
| 14 |
| 16 |
+---------------+
Now we will run the following commands one by one to find the path of taxon id 54255 to the root node. We will do this by finding the parent taxon id of the given id, and then using this id to find the parents and repeating the procedure until we have reached the root node:
mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.ncbi_taxon_id=54255;
+--------------------+-----------+-----------------+
| name | node_rank | parent_taxon_id |
+--------------------+-----------+-----------------+
| Thermofilum librum | species | 1793 |
+--------------------+-----------+-----------------+
mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=1793;
+-------------+-----------+-----------------+
| name | node_rank | parent_taxon_id |
+-------------+-----------+-----------------+
| Thermofilum | genus | 86865 |
+-------------+-----------+-----------------+
mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=86865;
+----------------+-----------+-----------------+
| name | node_rank | parent_taxon_id |
+----------------+-----------+-----------------+
| Thermofilaceae | family | 1791 |
+----------------+-----------+-----------------+
mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=1791;
+-----------------+-----------+-----------------+
| name | node_rank | parent_taxon_id |
+-----------------+-----------+-----------------+
| Thermoproteales | order | 149927 |
+-----------------+-----------+-----------------+
mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=149927;
+--------------+-----------+-----------------+
| name | node_rank | parent_taxon_id |
+--------------+-----------+-----------------+
| Thermoprotei | class | 12664 |
+--------------+-----------+-----------------+
mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=12664;
+---------------+-----------+-----------------+
| name | node_rank | parent_taxon_id |
+---------------+-----------+-----------------+
| Crenarchaeota | phylum | 1705 |
+---------------+-----------+-----------------+
mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=1705;
+---------+--------------+-----------------+
| name | node_rank | parent_taxon_id |
+---------+--------------+-----------------+
| Archaea | superkingdom | 102425 |
+---------+--------------+-----------------+
mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=102425;
+--------------------+-----------+-----------------+
| name | node_rank | parent_taxon_id |
+--------------------+-----------+-----------------+
| cellular organisms | no rank | 1 |
+--------------------+-----------+-----------------+
I have written a small stored procedure to do the same thing as above. Make a new file named path_to_root_node.sql with the following contents:
DROP PROCEDURE IF EXISTS path_to_root_node;
DELIMITER //
CREATE PROCEDURE path_to_root_node(IN leaf INT )
BEGIN
DECLARE parent_taxon_id INT;
DECLARE taxon_name VARCHAR(255);
DECLARE taxon_rank VARCHAR(255);
DECLARE path_to_root VARCHAR(255);
SET @path_to_root = '';
SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id INTO @taxon_name, @taxon_rank, @parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id=taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.ncbi_taxon_id=leaf;
SET @path_to_root=CONCAT(@path_to_root,@taxon_name,':',@taxon_rank);
loop_label: LOOP
SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id INTO @taxon_name, @taxon_rank, @parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id=taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=@parent_taxon_id;
SET @path_to_root=CONCAT(@path_to_root,";",@taxon_name,':',@taxon_rank);
IF (@parent_taxon_id <> 1) THEN
ITERATE loop_label;
ELSE
LEAVE loop_label;
END IF;
END LOOP;
SELECT @path_to_root;
END//
DELIMITER ;
Now load the stored procedure, call it, and it will give you the complete path from the given taxon id to root node
mysql> source path_to_root_node.sql
mysql> call path_to_root_node(54255);
|+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+|
@path_to_root
|+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+|
Thermofilum librum:species;Thermofilum:genus;Thermofilaceae:family;Thermoproteales:order;Thermoprotei:class;Crenarchaeota:phylum;Archaea:superkingdom;cellular organisms:no rank
|+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+|
You can also use the following one-liners on linux prompt using the above stored procedure:
[uzi@quince-srv2 ~/oneliners]$ mysql --user=root bioseqdb -e "source path_to_root_node.sql;call path_to_root_node(54255);" | awk -F"|" '$0!~/^@path_to_root/{print $0}'
Thermofilum librum:species;Thermofilum:genus;Thermofilaceae:family;Thermoproteales:order;Thermoprotei:class;Crenarchaeota:phylum;Archaea:superkingdom;cellular organisms:no rank
[uzi@quince-srv2 ~/oneliners]$ mysql --user=root bioseqdb -e "source path_to_root_node.sql;call path_to_root_node(9606);" | awk -F"|" '$0!~/^@path_to_root/{print $0}'
Homo sapiens:species;Homo:genus;Homininae:subfamily;Hominidae:family;Hominoidea:superfamily;Catarrhini:parvorder;Simiiformes:infraorder;Haplorrhini:suborder;Primates:order;Euarchontoglires:superorder;Eutheria:no rank;Theria:no rank;Mammalia:class;Amniota:no rank;Tetrapoda:no rank;Sarcopterygii:no rank;Euteleostomi:no rank;Teleostomi:no rank;Gnathostomata:superclass;Vertebrata:no rank;Craniata:subphylum;Chordata:phylum;Deuterostomia:no rank;Bilateria:no rank;Eumetazoa:no rank;Metazoa:kingdom;Opisthokonta:no rank;Eukaryota:superkingdom;cellular organisms:no rank
You can also do other fancy things with the stored-procedure. Say you have a list of accession numbers stored in a file accnos.list:
[uzi@quince-srv2 ~/oneliners]$ cat accnos.list
AB000389.1
AB000699.1
AB000700.1
AB000701.1
AB000702.1
AB001518.1
AB001724.1
AB001774.1
AB001775.1
AB001776.1
AB001777.1
AB001779.1
AB001781.1
AB001783.1
AB001784.1
AB001785.1
AB001791.1
AB001793.1
AB001797.1
AB001802.1
You are interested in finding the corresponding gID, taxa ID, and assignment at species and genus level for each accession number. You can get the gID and taxa ID from locally installed NT database using blastdbcmd and resolve the assignment at species and genus level using path_to_root_node.sql. For example, in the following script, we get the assignment at species level
[uzi@quince-srv2 ~/oneliners]$ cat accnos.list | /home/opt/ncbi-blast-2.2.28+/bin/blastdbcmd -db /home/opt/ncbi-blast-2.2.28+/db/nt -entry_batch - -outfmt "%a,%g,%T" | while IFS=$',' read -r -a myArray; do echo ${myArray[0]},${myArray[1]},${myArray[2]},$(mysql --user=root bioseqdb -e "source path_to_root_node.sql; call path_to_root_node(${myArray[2]});" | awk -F "|" '$0!~/^@path_to_root/{print $0}' | perl -ne '@a=split(/:/,join(/,/,grep {/species/} split(/;/,$_)));print $a[0]'); done
AB000389.1,11036396,81037,Pseudoalteromonas elyakovii
AB000699.1,3107908,153948,Nitrosomonas sp. AL212
AB000700.1,3107909,153949,Nitrosomonas sp. JL21
AB000701.1,3107910,153947,Nitrosomonas sp. GH22
AB000702.1,3107911,42353,Nitrosomonas sp.
AB001518.1,1871429,47467,Ixodes scapularis endosymbiont
AB001724.1,1902830,267859,Microcystis elabens
AB001774.1,1902837,85991,Chlamydia pecorum
AB001775.1,1902838,85991,Chlamydia pecorum
AB001776.1,1902839,85991,Chlamydia pecorum
AB001777.1,1902840,85991,Chlamydia pecorum
AB001779.1,1902842,83554,Chlamydia psittaci
AB001778.1,1902841,331636,Chlamydia psittaci
AB001780.1,1902843,83554,Chlamydia psittaci
AB001781.1,1902844,83554,Chlamydia psittaci
AB001782.1,1902845,83554,Chlamydia psittaci
AB001786.1,1902849,83554,Chlamydia psittaci
AB001787.1,1902850,83554,Chlamydia psittaci
AB001788.1,1902851,83554,Chlamydia psittaci
AB001789.1,1902852,83554,Chlamydia psittaci
AB001790.1,1902853,83554,Chlamydia psittaci
AB001812.1,1902875,83554,Chlamydia psittaci
AB001783.1,1902846,83555,Chlamydophila abortus
AB001784.1,1902847,83554,Chlamydia psittaci
AB001785.1,1902848,83556,Chlamydophila felis
AB001791.1,1902854,83554,Chlamydia psittaci
AB001793.1,1902856,83554,Chlamydia psittaci
AB001794.1,1902857,83554,Chlamydia psittaci
AB001800.1,1902863,83554,Chlamydia psittaci
AB001801.1,1902864,83554,Chlamydia psittaci
AB001803.1,1902866,83554,Chlamydia psittaci
AB001796.1,1902859,83554,Chlamydia psittaci
AB001797.1,1902860,83554,Chlamydia psittaci
AB001799.1,1902862,83554,Chlamydia psittaci
AB001802.1,1902865,83554,Chlamydia psittaci
Using "genus" as search pattern in grep command below gives assignment at genus level:
[uzi@quince-srv2 ~/oneliners]$ cat accnos.list | /home/opt/ncbi-blast-2.2.28+/bin/blastdbcmd -db /home/opt/ncbi-blast-2.2.28+/db/nt -entry_batch - -outfmt "%a,%g,%T" | while IFS=$',' read -r -a myArray; do echo ${myArray[0]},${myArray[1]},${myArray[2]},$(mysql --user=root bioseqdb -e "source path_to_root_node.sql; call path_to_root_node(${myArray[2]});" | awk -F "|" '$0!~/^@path_to_root/{print $0}' | perl -ne '@a=split(/:/,join(/,/,grep {/genus/} split(/;/,$_)));print $a[0]'); done
AB000389.1,11036396,81037,Pseudoalteromonas
AB000699.1,3107908,153948,Nitrosomonas
AB000700.1,3107909,153949,Nitrosomonas
AB000701.1,3107910,153947,Nitrosomonas
AB000702.1,3107911,42353,Nitrosomonas
AB001518.1,1871429,47467,
AB001724.1,1902830,267859,Microcystis
AB001774.1,1902837,85991,Chlamydia
AB001775.1,1902838,85991,Chlamydia
AB001776.1,1902839,85991,Chlamydia
AB001777.1,1902840,85991,Chlamydia
AB001779.1,1902842,83554,Chlamydia
AB001778.1,1902841,331636,Chlamydia
AB001780.1,1902843,83554,Chlamydia
AB001781.1,1902844,83554,Chlamydia
AB001782.1,1902845,83554,Chlamydia
AB001786.1,1902849,83554,Chlamydia
AB001787.1,1902850,83554,Chlamydia
AB001788.1,1902851,83554,Chlamydia
AB001789.1,1902852,83554,Chlamydia
AB001790.1,1902853,83554,Chlamydia
AB001812.1,1902875,83554,Chlamydia
AB001783.1,1902846,83555,Chlamydophila
AB001784.1,1902847,83554,Chlamydia
AB001785.1,1902848,83556,Chlamydophila
AB001791.1,1902854,83554,Chlamydia
AB001793.1,1902856,83554,Chlamydia
AB001794.1,1902857,83554,Chlamydia
AB001800.1,1902863,83554,Chlamydia
AB001801.1,1902864,83554,Chlamydia
AB001803.1,1902866,83554,Chlamydia
AB001796.1,1902859,83554,Chlamydia
AB001797.1,1902860,83554,Chlamydia
AB001799.1,1902862,83554,Chlamydia
AB001802.1,1902865,83554,Chlamydia
Last Updated by Dr Umer Zeeshan Ijaz on 13/04/2014.