Fancy One-liners

by Umer Zeeshan Ijaz and Chris Quince

This webpage lists some of the one-liners that we frequently use in metagenomic analyses. You can click on the following links to browse through different topics. You can copy/paste the commands as they are in your terminal screen, provided you follow the same naming conventions and folder structures as we have. We are sharing these codes with the intention that if they are useful and help you in your analyses, then we will be appropriately credited as considerable effort has been put into devising them.

Perl one-liners
Extracting information from GBK files
Identifying duplicates in two FASTA files (awk)
Converting "Sample[TAB]Feature[TAB]Abundance" list to tab-delimited abundance table
Dereplicating Reads
Paired-end assembler
Generating abundance tables and trees from CREST and RDP classifiers
Spatial comparison of read qualities between different sequencing runs (Illumina paired-end reads)
Subsampling FASTA and FASTQ files
Identifying low-coverage genomes in metagenomic shotgun sequencing samples (mock communities)
Variant Calling
Getting linkage information (reads that map to multiple contigs/genomes) from SAM files
Does subsampling improve assembly quality for very high coverage datasets when using Velvet without removing noise from reads?
Extracting 16S rRNA sequences from NCBI's locally installed nt database using blastdbcmd
Resolving NCBI Taxonomy using BioSQL

Perl one-liners

Printing all FASTA headers

Perl reads each line into $_ variable thanks to -n option, then it calls print statement if the line starts with >. You can exclude $_ in the print statement:

[uzi@quince-srv2 ~/oneliners$ perl -ne 'print if /^>/' ABKL02.fsa 
>gi|219671117|gb|ABKL02000001.1| Clostridium difficile QCD-23m63 contig00001_2, whole genome shotgun sequence
>gi|219671116|gb|ABKL02000002.1| Clostridium difficile QCD-23m63 contig00002_2, whole genome shotgun sequence
>gi|219671115|gb|ABKL02000003.1| Clostridium difficile QCD-23m63 contig00003_2, whole genome shotgun sequence
>gi|219671114|gb|ABKL02000004.1| Clostridium difficile QCD-23m63 contig00004_2, whole genome shotgun sequence
>gi|219671113|gb|ABKL02000005.1| Clostridium difficile QCD-23m63 contig00005_2, whole genome shotgun sequence
>gi|219671112|gb|ABKL02000006.1| Clostridium difficile QCD-23m63 contig00006_2, whole genome shotgun sequence
>gi|219671111|gb|ABKL02000007.1| Clostridium difficile QCD-23m63 contig00007_2, whole genome shotgun sequence
>gi|219671110|gb|ABKL02000008.1| Clostridium difficile QCD-23m63 contig00008_2, whole genome shotgun sequence
>gi|219671109|gb|ABKL02000009.1| Clostridium difficile QCD-23m63 contig00009_2, whole genome shotgun sequence
...
...

Printing FASTA headers with line numbers

This one-liner uses the $. special variable. This variable stands for "current line number." Each time Perl reads in the next line, it increments $. by one

[uzi@quince-srv2 ~/oneliners$ perl -ne 'print "$. $_" if /^>/ ' ABKL02.fsa 
1 >gi|219671117|gb|ABKL02000001.1| Clostridium difficile QCD-23m63 contig00001_2, whole genome shotgun sequence
133 >gi|219671116|gb|ABKL02000002.1| Clostridium difficile QCD-23m63 contig00002_2, whole genome shotgun sequence
1474 >gi|219671115|gb|ABKL02000003.1| Clostridium difficile QCD-23m63 contig00003_2, whole genome shotgun sequence
1486 >gi|219671114|gb|ABKL02000004.1| Clostridium difficile QCD-23m63 contig00004_2, whole genome shotgun sequence
1564 >gi|219671113|gb|ABKL02000005.1| Clostridium difficile QCD-23m63 contig00005_2, whole genome shotgun sequence
2566 >gi|219671112|gb|ABKL02000006.1| Clostridium difficile QCD-23m63 contig00006_2, whole genome shotgun sequence
2904 >gi|219671111|gb|ABKL02000007.1| Clostridium difficile QCD-23m63 contig00007_2, whole genome shotgun sequence
4311 >gi|219671110|gb|ABKL02000008.1| Clostridium difficile QCD-23m63 contig00008_2, whole genome shotgun sequence
...
...

Printing top 5 lines of the abundance table (emulating head -5)

[uzi@quince-srv2 ~/oneliners$ perl -ne 'print if $. <= 5' phylum.csv
Samples,S113_TAGAGCTGCCAT,S104_CCGAACGTCACT,S61_CTCATCATGTTC,S339_ACACCTGCGATC,S77_CACGATGGTCAT,S266_ACAGGAGGGTGT,S337_AACGAGGCAACG,S262_TTCTAGAGTGCG,S145_CAACGCTAGAAT,S11_TGGAAGAACGGC,S208_AGATGTCCGTCA,S108_GAACCTATGACA,S179_CTGCATACTGAG,S157_CATAAGGGAGGC,S163_GTGTGCTAACGT,S333_GCCTTACGATAG,S158_TGTGTTACTCCT,S329_CAATGTAGACAC,S23_CCTCTGAGAGCT,S97_GACGCTTTGCTG,S209_GCACCTGTTGAA,S68_CGAAGCATCTAC,S275_GCAAATCAGCCT,S13_AACCCAGATGAT,S147_TCTGTAGAGCCA,S178_TTGGTCTCCTCT,S183_AGCAGCTATTGC,S26_CGTGCACAATTG,S274_AGCACCGGTCTT,S597_TGTGGTGATGTA,S169_AAGAGTCTCTAG,S92_TTCCAGGCAGAT,S15_AACAAACTGCCA,S14_GATATACCAGTG,S10_CTTCCCTAACTC,S277_TCTTCAACTACC,S112_TGGCCGTTACTG,S199_CTCGGATAGATC,S307_TGACAACCGAAT,S221_TAGAGGCGTAGG,S144_CGAATGAGTCAT,S276_AGCGGCCTATTA,S595_CTATCCAAGTGG,S127_TCTAGCCTGGCA,S43_AGCGCTCACATC,S259_CGGTCTGTCTGA,S156_CGATATCAGTAG,S237_GATGATAACCCA,S213_TGAGTCATTGAG,S111_GTGAGTCATACC,S65_TCGGCGATCATC,S110_GAACAGCTCTAC,S263_ACGGATGTTATG,S181_GTCAATTAGTGG,S7_CAATCGGCTTGC,S153_TCGGATCTGTGA,S207_ATACGCATCAAG,S50_TAGCGCGAACTT,S594_TTCCTGTTAACC,S204_TACTGAGCCTCG,S4_CTAGGATCACTG,S598_CTTTCGTTCAAC,S171_AGATCTATGCAG,S66_GTTTCACGCGAA,S176_TACCTAGTGAGA,S206_AGCTTCGACAGT,S210_CCTAGAGAAACT,S60_TCGAGCCGATCT,S100_CAAACCTATGGC,S95_CTTGGAGGCTTA,S338_GAAGACAGCGAC,S609_CATCTGGGCAAT,S164_CTTGCGGCAATC,S74_AGGGTACAGGGT,S239_TTGTATGACAGG,S610_TGTCCGTGGATC,S118_GGCTAAACTATG,S12_GCTAGACACTAC,S315_TTGCCTGGGTCA,S53_ATATCGCGATGA,S51_CATACACGCACC,S225_TGCCGAGTAATC,S48_ACGCACATACAA,S331_TGGCGATACGTT,S310_TAGGCTCGTGCT,S52_ACCTCAGTCAAG,S600_GTTGGCGTTACA,S220_AAGCAGATTGTC,S608_TCTAACGAGTGC,S114_ATCTAGTGGCAA,S592_TCTGGAACGGTT,S267_GCTGTCGTCAAC,S593_GTACTACCTCGG,S316_CAATTCTGCTTC,S302_TGCAAGCTAAGT,S151_TGCGGTTGACTC,S90_GTCACCAATCCG,S96_ACGTGGTTCCAC,S59_CAAAGTTTGCGA,S261_GTACATGTCGCC,S305_GTGTGTGCCATA,S173_CGGCAAACACTT,S278_TGGAATTCGGCT,S1_ATTGCAAGCAAC,S217_TGCTCCGTAGAA,S165_TGAGGTTTGATG,S47_ATGGCCTGACTA,S49_TGAGTGGTCTGT,S152_TGAGAAGAAAGG,S335_TACCTGTGTCTT,S18_CAAGCCCTAGTA,S98_ACAGGGTTTGTA,S56_AGCAGGCACGAA,S107_CGACACGGAGAA,S73_TGTACCAACCGA,S99_GCCTATGAGATC,S230_GTAACCACCACC,S67_ACAAGAACCTTG,S202_TCCTTAGAAGGC,S9_TGACCGGCTGTT,S162_TAACCCGATAGA,S219_CGAGTTCATCGA,S205_CTCGATGTAAGC,S6_TGAGGACTACCT,S19_TAGTGTCGGATC,S180_CGATGAATATCG,S200_TTCCCGAAACGA,S601_GAAGTAGCGAGC,S236_ACTACCTCTTCA,S119_AAGAGCAGAGCC,S161_AGTGGCACTATC,S121_TCAACCCGTGAA,S211_GAGGTTCTTGAC,S279_TAAGATGCAGTC,S115_CCTTCAATGGGA,S603_GCGGAAACATGG,S106_CCATCACATAGG,S212_CTGTAAAGGTTG,S22_CGTAGAGCTCTC,S25_GCGGACTATTCA,S55_CCGATGCCTTGA,S16_GTAGACATGTGT,S155_CACAGGATTACC,S44_TGGTTATGGCAC,S270_ATAGCTTCGTGG,S314_CTCCTTAAGGCG,S599_CCGAAGATTCTG,S167_AAGAAGCCGGAC,S148_CCGACTCTAGGT,S172_GCACAAGGCAAG,S150_GACAACGAATCT,S606_ACGTAACCACGT,S271_CGGGATCAAATT,S63_GGCCTATAAGTC,S233_CATAGCTCGGTC,S273_ATCTTGGAGTCG,S20_GTCATAAGAACC,S605_TGCATGACAGTC,S235_TATGGAGCTAGT,S17_TACAGTTACGCG,S635_TTGCGACAAAGT,S591_TACCACAACGAA,S71_CTATGCCGGCTA,S228_ACCTTGACAAGA,S46_TGCTACAGACGT,S93_TATGGTACCCAG,S24_CCTCGATGCAGT,S122_GTTTGAAACACG,S91_CACTAACAAACG,S27_CGGCCTAAGTTC,S149_ATCCTACGAGCA,S340_GGCGTTGCATTC,S128_AATGCAATGCGT,S109_ATGCCGGTAATA,S177_CGTTCTGGTGGT,S94_CACGACTTGACA,S175_TTCCGAATCGGC,S69_GTTTGGCCACAC,S203_GATGGACTTCAA,S117_ACATACTGAGCA,S216_CATGCGGATCCT,S154_GCCGGTACTCTA,S57_TACGCAGCACTA,S8_AACACTCGATCG,S272_AGTCATCGAATG,S268_AAGCTTGAAACC,S168_ACGGGATACAGG,S102_ACCATCCAACGA,S62_CCAGGGACTTCT,S2_CACGTGACATGT,S612_GTTGGTTGGCAT,S72_GTGGTATGGGAG,S105_ACACCAACACCA,S101_ATCGCTTAAGGC,S223_TAGACCGACTCC,S214_TACGGCAGTTCA,S126_CACTTTGGGTGC,S264_TTGAGGCTACAA,S590_TGCGAGTATATG,S222_TCAGCGCCGTTA,S269_TAAGCGTCTCGA,S613_TTCCACACGTGG,S5_GATGACCCAAAT,S45_TTGCACCGTCGA,S170_TCCGTCATGGGT,S257_CTACCACGGTAC,S201_GAACTTTAGCGC,S3_CACAGTTGAAGT,S64_TCCATTTCATGC,S607_GTCGGAAATTGT,S146_ATCAGAGCCCAT,S255_GGTAAGTTTGAC,S76_TTGGCGGGTTAT,S224_GTCAACGCTGTC,S103_GCAATAGGAGGA,S611_ACTCGGCCAACT,S182_AGTACGCAGTCT,S54_CGCCGGTAATCT,S75_AGAGTGCTAATC,S125_GCTCAGGACTCT,S166_ATTGCTGGTCGA,S159_GGTACCTGCAAT,S70_TGACGTAGAACT,S58_CGCTTAGTGCTG,S602_TTGCGGACCCTA,S317_ACTGGCAAACCT,S596_GTGTCCGGATTC,S124_TCGCCAGTGCAT,S21_GTCCGCAAGTTA,S238_GGCCCAATATAA,S123_AGAGAGACAGGT,S174_GCGAGTTCCTGT,S218_TGATAGGTACAC,S604_AACGTTAGTGTG,S120_GGAGAGATCACG,S318_AATCAGAGCTTG,S215_TGCACAGTCGCT,S160_TCGCCTATAAGG,S265_GTAGGAACCGGA
"Acidobacteria",2,368,6,12,62,476,227,85,58,3,918,131,34,23,17,250,67,15,2,46,69,51,0,23,18,163,120,164,224,2,104,81,143,257,7,61,279,26,139,13,319,47,3,104,13,39,113,39,72,63,132,48,4,176,8,7,31,60,7,396,38,4,86,61,230,24,176,1,13,99,62,15,25,5,121,19,62,4,51,112,275,56,19,16,1,141,1,8,5,268,32,291,27,65,23,13,282,135,1,99,157,545,93,70,144,44,19,31,8,50,10,59,39,126,6,27,10,74,42,108,4,21,14,28,46,52,9,3,45,55,2,164,122,142,332,171,289,134,4,127,76,79,55,13,656,59,4,31,28,194,10,41,413,293,101,207,9,82,10,29,3,75,2,37,47,122,179,119,229,6,35,12,67,41,115,129,231,21,171,339,7,85,2,1,157,218,75,713,12,125,2,3,322,8,59,75,0,0,282,27,133,3,25,42,32,72,9,29,241,11,28,68,210,75,536,1,62,70,108,4,73,88,8,18,272,91,0,102,2,84,78,70,13,4,35,135,57,186,586
"Actinobacteria",6,87,15,30,162,200,141,165,101,0,367,47,59,9,3,127,14,13,10,82,39,50,1,47,58,86,211,236,50,186,33,264,136,235,6,27,126,26,83,23,527,23,565,28,38,72,39,73,25,31,118,48,10,183,16,6,28,90,763,304,59,9,110,99,113,8,78,5,7,111,49,12,16,6,300,16,74,8,32,103,284,71,37,5,0,187,4,2,22,382,105,80,76,58,44,11,1633,133,10,137,193,513,22,118,40,25,56,25,3,24,11,82,35,45,16,36,14,86,33,167,1,5,17,18,46,43,11,2,112,78,5,83,53,57,311,1620,128,35,8,195,52,95,78,17,201,40,5,9,66,210,19,60,124,240,102,32,10,538,44,49,2,216,6,30,102,259,286,76,562,17,51,5,56,30,64,167,176,29,125,284,8,82,8,4,41,62,20,344,17,138,5,9,131,15,14,30,0,1,656,14,34,6,30,114,60,69,28,32,194,12,73,72,193,23,236,7,92,100,141,4,31,47,11,64,2901,65,238,80,3,116,71,55,5,104,43,100,33,68,1140
"Bacteroidetes",14,2526,40,285,1251,4724,7950,1576,1582,7,13771,679,491,109,17,7705,250,119,14,749,660,960,0,205,331,1772,3111,5167,931,202,264,4070,2885,6917,11,279,2077,183,5253,26,14552,135,769,1081,116,641,703,287,229,120,2797,92,36,4983,29,26,145,735,1267,8271,310,33,1419,1692,3170,91,1796,14,110,1479,1080,509,85,30,4419,694,951,13,1590,1208,4233,2214,277,222,5,2248,14,6,802,6901,900,2136,661,3280,1661,51,24672,1979,24,1750,6593,11393,357,1522,767,158,491,265,13,460,117,614,457,545,28,204,351,1056,669,2026,9,10,110,47,545,541,86,16,458,977,20,2478,451,543,5941,15296,1588,492,15,3820,704,1089,1321,93,4938,2036,47,46,406,3677,100,3945,3717,9054,1755,835,48,2417,135,551,34,3149,22,976,1194,4405,6098,1496,9924,33,546,120,724,177,1316,2165,4187,92,2133,5088,31,2312,38,13,1468,1228,228,8310,44,1400,189,35,1868,108,106,266,0,26,10963,34,534,135,248,1763,593,791,265,218,5421,1254,744,777,4338,124,5035,261,1175,862,2770,22,159,344,60,422,21848,4352,259,876,24,1195,805,1503,11,91,776,7774,169,1034,17498
"Chloroflexi",297,4778,333,518,1877,4905,7863,2618,1398,82,12343,1654,1005,500,264,2659,1243,261,101,1471,1010,1008,1,830,447,2919,2220,2752,1642,29,1089,5158,2502,5998,153,1096,6151,497,3161,335,5255,619,163,2009,241,700,1081,554,837,794,2219,606,110,3580,277,68,749,806,189,8837,1334,146,2443,1108,3538,316,3163,200,414,2655,1250,296,263,221,2572,435,1227,114,2271,3484,5813,299,247,383,11,4496,67,112,377,2731,1640,2711,989,2086,635,189,15615,3246,265,1085,5727,17216,1196,2184,1992,463,879,1028,42,812,561,1491,1210,2011,196,565,233,1728,888,2284,96,202,553,303,659,1226,230,59,698,1069,130,3012,1708,619,6314,8725,5504,1149,97,2026,1416,1855,1169,376,5862,1924,136,459,1015,6826,372,805,4001,7577,537,2921,136,2874,526,919,64,2450,168,213,982,3827,2568,2750,9880,133,864,253,1845,698,4384,3452,4300,253,3599,6613,208,971,415,135,3373,3696,1420,15173,371,2767,65,128,4741,367,529,723,0,35,7571,398,1605,126,437,1233,1441,1493,552,725,4915,237,840,2621,3659,910,10517,64,1370,2067,2687,88,864,859,382,1288,15388,4590,56,1757,55,1028,2902,1931,148,125,1606,4289,671,1821,8244

Printing last line of abundance table (emulating tail -1)

[uzi@quince-srv2 ~/oneliners$ perl -ne 'print if eof' phylum.csv
Cyanobacteria/Chloroplast,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Printing last 5 lines of abundance table (emulating tail -5)

Here, we push each line to @a array, and then replace it with a slice of itself. We do @a = @a[@a-5..$#a], which means, replace @a with last 5 elements of @a. @a-5 is evaluated in scalar context here and it returns number of elements in the array minus 5. #$a is the last index in the @a array.

[uzi@quince-srv2 ~/oneliners$ perl -ne 'push @a, $_; @a = @a[@a-5..$#a]; END { print @a }' phylum.csv
"Lentisphaerae",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"Elusimicrobia",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,3,0,0,0,0,0,4,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"Fusobacteria",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"Deinococcus-Thermus",0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Cyanobacteria/Chloroplast,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0

Converting comma-delimited format to tab-delimited format

-a enables the automatic splitting of the current line into fields in the @F array. The splitting happens on the whitespace character by default. The -l argument ensures that print outputs a newline at the end of each line

[uzi@quince-srv2 ~/oneliners$ perl -F"," -alne 'print join("\t",@F)' phylum.csv | head -3
Samples	S113_TAGAGCTGCCAT	S104_CCGAACGTCACT	S61_CTCATCATGTTC	S339_ACACCTGCGATC	S77_CACGATGGTCAT	S266_ACAGGAGGGTGT	S337_AACGAGGCAACG	S262_TTCTAGAGTGCG	S145_CAACGCTAGAAT	S11_TGGAAGAACGGC	S208_AGATGTCCGTCA	S108_GAACCTATGACA	S179_CTGCATACTGAG	S157_CATAAGGGAGGC	S163_GTGTGCTAACGT	S333_GCCTTACGATAG	S158_TGTGTTACTCCT	S329_CAATGTAGACAC	S23_CCTCTGAGAGCT	S97_GACGCTTTGCTG	S209_GCACCTGTTGAA	S68_CGAAGCATCTAC	S275_GCAAATCAGCCT	S13_AACCCAGATGAT	S147_TCTGTAGAGCCA	S178_TTGGTCTCCTCT	S183_AGCAGCTATTGC	S26_CGTGCACAATTG	S274_AGCACCGGTCTT	S597_TGTGGTGATGTA	S169_AAGAGTCTCTAG	S92_TTCCAGGCAGAT	S15_AACAAACTGCCA	S14_GATATACCAGTG	S10_CTTCCCTAACTC	S277_TCTTCAACTACC	S112_TGGCCGTTACTG	S199_CTCGGATAGATC	S307_TGACAACCGAAT	S221_TAGAGGCGTAGG	S144_CGAATGAGTCAT	S276_AGCGGCCTATTA	S595_CTATCCAAGTGG	S127_TCTAGCCTGGCA	S43_AGCGCTCACATC	S259_CGGTCTGTCTGA	S156_CGATATCAGTAG	S237_GATGATAACCCA	S213_TGAGTCATTGAG	S111_GTGAGTCATACC	S65_TCGGCGATCATC	S110_GAACAGCTCTAC	S263_ACGGATGTTATG	S181_GTCAATTAGTGG	S7_CAATCGGCTTGC	S153_TCGGATCTGTGA	S207_ATACGCATCAAG	S50_TAGCGCGAACTT	S594_TTCCTGTTAACC	S204_TACTGAGCCTCG	S4_CTAGGATCACTG	S598_CTTTCGTTCAAC	S171_AGATCTATGCAG	S66_GTTTCACGCGAA	S176_TACCTAGTGAGA	S206_AGCTTCGACAGT	S210_CCTAGAGAAACT	S60_TCGAGCCGATCT	S100_CAAACCTATGGC	S95_CTTGGAGGCTTA	S338_GAAGACAGCGAC	S609_CATCTGGGCAAT	S164_CTTGCGGCAATC	S74_AGGGTACAGGGT	S239_TTGTATGACAGG	S610_TGTCCGTGGATC	S118_GGCTAAACTATG	S12_GCTAGACACTAC	S315_TTGCCTGGGTCA	S53_ATATCGCGATGA	S51_CATACACGCACC	S225_TGCCGAGTAATC	S48_ACGCACATACAA	S331_TGGCGATACGTT	S310_TAGGCTCGTGCT	S52_ACCTCAGTCAAG	S600_GTTGGCGTTACA	S220_AAGCAGATTGTC	S608_TCTAACGAGTGC	S114_ATCTAGTGGCAA	S592_TCTGGAACGGTT	S267_GCTGTCGTCAAC	S593_GTACTACCTCGG	S316_CAATTCTGCTTC	S302_TGCAAGCTAAGT	S151_TGCGGTTGACTC	S90_GTCACCAATCCG	S96_ACGTGGTTCCAC	S59_CAAAGTTTGCGA	S261_GTACATGTCGCC	S305_GTGTGTGCCATA	S173_CGGCAAACACTT	S278_TGGAATTCGGCT	S1_ATTGCAAGCAAC	S217_TGCTCCGTAGAA	S165_TGAGGTTTGATG	S47_ATGGCCTGACTA	S49_TGAGTGGTCTGT	S152_TGAGAAGAAAGG	S335_TACCTGTGTCTT	S18_CAAGCCCTAGTA	S98_ACAGGGTTTGTA	S56_AGCAGGCACGAA	S107_CGACACGGAGAA	S73_TGTACCAACCGA	S99_GCCTATGAGATC	S230_GTAACCACCACC	S67_ACAAGAACCTTG	S202_TCCTTAGAAGGC	S9_TGACCGGCTGTT	S162_TAACCCGATAGA	S219_CGAGTTCATCGA	S205_CTCGATGTAAGC	S6_TGAGGACTACCT	S19_TAGTGTCGGATC	S180_CGATGAATATCG	S200_TTCCCGAAACGA	S601_GAAGTAGCGAGC	S236_ACTACCTCTTCA	S119_AAGAGCAGAGCC	S161_AGTGGCACTATC	S121_TCAACCCGTGAA	S211_GAGGTTCTTGAC	S279_TAAGATGCAGTC	S115_CCTTCAATGGGA	S603_GCGGAAACATGG	S106_CCATCACATAGG	S212_CTGTAAAGGTTG	S22_CGTAGAGCTCTC	S25_GCGGACTATTCA	S55_CCGATGCCTTGA	S16_GTAGACATGTGT	S155_CACAGGATTACC	S44_TGGTTATGGCAC	S270_ATAGCTTCGTGG	S314_CTCCTTAAGGCG	S599_CCGAAGATTCTG	S167_AAGAAGCCGGAC	S148_CCGACTCTAGGT	S172_GCACAAGGCAAG	S150_GACAACGAATCT	S606_ACGTAACCACGT	S271_CGGGATCAAATT	S63_GGCCTATAAGTC	S233_CATAGCTCGGTC	S273_ATCTTGGAGTCG	S20_GTCATAAGAACC	S605_TGCATGACAGTC	S235_TATGGAGCTAGT	S17_TACAGTTACGCG	S635_TTGCGACAAAGT	S591_TACCACAACGAA	S71_CTATGCCGGCTA	S228_ACCTTGACAAGA	S46_TGCTACAGACGT	S93_TATGGTACCCAG	S24_CCTCGATGCAGT	S122_GTTTGAAACACG	S91_CACTAACAAACG	S27_CGGCCTAAGTTC	S149_ATCCTACGAGCA	S340_GGCGTTGCATTC	S128_AATGCAATGCGT	S109_ATGCCGGTAATA	S177_CGTTCTGGTGGT	S94_CACGACTTGACA	S175_TTCCGAATCGGC	S69_GTTTGGCCACAC	S203_GATGGACTTCAA	S117_ACATACTGAGCA	S216_CATGCGGATCCT	S154_GCCGGTACTCTA	S57_TACGCAGCACTA	S8_AACACTCGATCG	S272_AGTCATCGAATG	S268_AAGCTTGAAACC	S168_ACGGGATACAGG	S102_ACCATCCAACGA	S62_CCAGGGACTTCT	S2_CACGTGACATGT	S612_GTTGGTTGGCAT	S72_GTGGTATGGGAG	S105_ACACCAACACCA	S101_ATCGCTTAAGGC	S223_TAGACCGACTCC	S214_TACGGCAGTTCA	S126_CACTTTGGGTGC	S264_TTGAGGCTACAA	S590_TGCGAGTATATG	S222_TCAGCGCCGTTA	S269_TAAGCGTCTCGA	S613_TTCCACACGTGG	S5_GATGACCCAAAT	S45_TTGCACCGTCGA	S170_TCCGTCATGGGT	S257_CTACCACGGTAC	S201_GAACTTTAGCGC	S3_CACAGTTGAAGT	S64_TCCATTTCATGC	S607_GTCGGAAATTGT	S146_ATCAGAGCCCAT	S255_GGTAAGTTTGAC	S76_TTGGCGGGTTAT	S224_GTCAACGCTGTC	S103_GCAATAGGAGGA	S611_ACTCGGCCAACT	S182_AGTACGCAGTCT	S54_CGCCGGTAATCT	S75_AGAGTGCTAATC	S125_GCTCAGGACTCT	S166_ATTGCTGGTCGA	S159_GGTACCTGCAAT	S70_TGACGTAGAACT	S58_CGCTTAGTGCTG	S602_TTGCGGACCCTA	S317_ACTGGCAAACCT	S596_GTGTCCGGATTC	S124_TCGCCAGTGCAT	S21_GTCCGCAAGTTA	S238_GGCCCAATATAA	S123_AGAGAGACAGGT	S174_GCGAGTTCCTGT	S218_TGATAGGTACAC	S604_AACGTTAGTGTG	S120_GGAGAGATCACG	S318_AATCAGAGCTTG	S215_TGCACAGTCGCT	S160_TCGCCTATAAGG	S265_GTAGGAACCGGA
"Acidobacteria"	2	368	6	12	62	476	227	85	58	3	918	131	34	23	17	250	67	15	2	46	69	51	0	23	18	16120	164	224	2	104	81	143	257	7	61	279	26	139	13	319	47	3	104	13	39	113	39	72	63	132	48	4	1731	60	7	396	38	4	86	61	230	24	176	1	13	99	62	15	25	5	121	19	62	4	51	112	275	56	19	16141	1	8	5	268	32	291	27	65	23	13	282	135	1	99	157	545	93	70	144	44	19	31	8	50	10	59	39126	6	27	10	74	42	108	4	21	14	28	46	52	9	3	45	55	2	164	122	142	332	171	289	134	4	127	7679	55	13	656	59	4	31	28	194	10	41	413	293	101	207	9	82	10	29	3	75	2	37	47	122	179	119	2235	12	67	41	115	129	231	21	171	339	7	85	2	1	157	218	75	713	12	125	2	3	322	8	59	75	0	0	282	27	133	3	25	42	32	72	9	29	241	11	28	68	210	75	536	1	62	70	108	4	73	88	8	18	272	91102	2	84	78	70	13	4	35	135	57	186	586
"Actinobacteria"	6	87	15	30	162	200	141	165	101	0	367	47	59	9	3	127	14	13	10	82	39	50	1	47	5886	211	236	50	186	33	264	136	235	6	27	126	26	83	23	527	23	565	28	38	72	39	73	25	31	118	48	10183	16	6	28	90	763	304	59	9	110	99	113	8	78	5	7	111	49	12	16	6	300	16	74	8	32	103	2871	37	5	0	187	4	2	22	382	105	80	76	58	44	11	1633	133	10	137	193	513	22	118	40	25	56	25	3	24	11	82	35	45	16	36	14	86	33	167	1	5	17	18	46	43	11	2	112	78	5	83	53	57	311	1620	1235	8	195	52	95	78	17	201	40	5	9	66	210	19	60	124	240	102	32	10	538	44	49	2	216	6	30	10259	286	76	562	17	51	5	56	30	64	167	176	29	125	284	8	82	8	4	41	62	20	344	17	138	5	9	1315	14	30	0	1	656	14	34	6	30	114	60	69	28	32	194	12	73	72	193	23	236	7	92	100	141	4	3147	11	64	2901	65	238	80	3	116	71	55	5	104	43	100	33	68	1140

Permutations and Combinations

In the first one-liner, use p(@a,@a,@a,@a) to generate tetra-mers. You can also give different length arrays instead of @a to p(). Last one-liner is bash-equivalent for comparative purposes.

[uzi@quince-srv2 ~/oneliners$ perl -le 'sub p{my $l=pop @_;unless(@_){return map [$_],@$l;}return map { my $ll=$_; map [@$ll,$_],@$l} p(@_);} @a=[A,C,G,T]; print join("", @$_) for p(@a,@a);'
AA
AC
AG
AT
CA
CC
CG
CT
GA
GC
GG
GT
TA
TC
TG
TT
[uzi@quince-srv2 ~/oneliners$ perl -le 'sub c{return [] unless @_;my $f=shift;my @r=c(@_); return @r,map{[$f,@$_]}@r;} print join("",@$_) for c(A,C,G,T)'

T
G
GT
C
CT
CG
CGT
A
AT
AG
AGT
AC
ACT
ACG
ACGT
[uzi@quince-srv2 ~/oneliners$ echo {A,C,G,T}{A,C,G,T}{A,C,G,T}{A,C,G,T}
AAAA AAAC AAAG AAAT AACA AACC AACG AACT AAGA AAGC AAGG AAGT AATA AATC AATG AATT ACAA ACAC ACAG ACAT ACCA ACCC ACCG ACCT ACGA ACGC ACGG ACGT ACTA ACTC ACTG ACTT AGAA AGAC AGAG AGAT AGCA AGCC AGCG AGCT AGGA AGGC AGGG AGGT AGTA AGTC AGTG AGTT ATAA ATAC ATAG ATAT ATCA ATCC ATCG ATCT ATGA ATGC ATGG ATGT ATTA ATTC ATTG ATTT CAAA CAAC CAAG CAAT CACA CACC CACG CACT CAGA CAGC CAGG CAGT CATA CATC CATG CATT CCAA CCAC CCAG CCAT CCCA CCCC CCCG CCCT CCGA CCGC CCGG CCGT CCTA CCTC CCTG CCTT CGAA CGAC CGAG CGAT CGCA CGCC CGCG CGCT CGGA CGGC CGGG CGGT CGTA CGTC CGTG CGTT CTAA CTAC CTAG CTAT CTCA CTCC CTCG CTCT CTGA CTGC CTGG CTGT CTTA CTTC CTTG CTTT GAAA GAAC GAAG GAAT GACA GACC GACG GACT GAGA GAGC GAGG GAGT GATA GATC GATG GATT GCAA GCAC GCAG GCAT GCCA GCCC GCCG GCCT GCGA GCGC GCGG GCGT GCTA GCTC GCTG GCTT GGAA GGAC GGAG GGAT GGCA GGCC GGCG GGCT GGGA GGGC GGGG GGGT GGTA GGTC GGTG GGTT GTAA GTAC GTAG GTAT GTCA GTCC GTCG GTCT GTGA GTGC GTGG GTGT GTTA GTTC GTTG GTTT TAAA TAAC TAAG TAAT TACA TACC TACG TACT TAGA TAGC TAGG TAGT TATA TATC TATG TATT TCAA TCAC TCAG TCAT TCCA TCCC TCCG TCCT TCGA TCGC TCGG TCGT TCTA TCTC TCTG TCTT TGAA TGAC TGAG TGAT TGCA TGCC TGCG TGCT TGGA TGGC TGGG TGGT TGTA TGTC TGTG TGTT TTAA TTAC TTAG TTAT TTCA TTCC TTCG TTCT TTGA TTGC TTGG TTGT TTTA TTTC TTTG TTTT

Reverse complimenting sequences

[uzi@quince-srv2 ~/oneliners$ echo -e "ACGTACGTACGT\nAGCTACT" | perl -nle 'print map{$_ =~ tr/ACGT/TGCA/; $_} reverse split("",$_)'
ACGTACGTACGT
AGTAGCT

Manipulating FASTQ files

We can print four lines of FASTQ file at a time

[uzi@quince-srv2 ~/oneliners$ perl -ne 'push @a, $_; @a = @a[@a-4..$#a]; print @a if ($. % 4 == 0)' forward.fastq | head
@MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCTGAATTACTGGGCTTAAAGCGTACGTAGGCGGATAGGAAAGTTGGGGGTGAAATCCCAG
+
DDDDCFFFFCCDGGGGGGGGGGGGHHHHHHHHHHHHHHHGGGGGHGHHHHHHHHHHHHHHHGGGGGGHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHGGGGGHGGHGHHHHHHHHHHHHHHHHHHHHHHHHGHHGGGGGGGHHHHHGHGGHHHGHGHHGGGGFGGHHHHGGGGGGGGGFFFFFFFFFF0;:;FFFFFFFFF/:;FFFFFFFFFFFFFFFFF:.;;BBFFFFFFFFFFFFFFFFFFFF
@MSQ-M01442:38:000000000-A5H4M:1:1102:12736:13241 1:N:0:CGAGGCTGAAGGAGTA
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCGGGGAGGAAGGCGACAAGGTTAATAACCTTGTCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGCAATACGGAGGGTGCAAGCGTTAATCGGAAT
+
AAAA1>>1AD@1AE?GFBF0C0FGHHHF21AFGHFGCF/EEGHFGEHHHHHFDA1AAGFB@EEGEEEHHHFE1GE>0/FCEEGFEFFGGHFDHFHHFFGDDFG?CC@ACCG0<..BB/9FBAAAB-FE-B-F-9@99-;-:FB?A--=@-9-/BFEFA9AEBB-ABAE
...
...

Infact we are printing all the lines:

[uzi@quince-srv2 ~/oneliners$ perl -ne 'push @a, $_; @a = @a[@a-4..$#a]; print @a if ($. % 4 == 0)' forward.fastq | wc -l
40000
[uzi@quince-srv2 ~/oneliners$ wc -l < forward.fastq
40000

So why is this functionality useful? We can convert the FASTQ file to FASTA file quite easily by printing first two elements of @a:

[uzi@quince-srv2 ~/oneliners$ perl -ne 'push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){print ">".substr($a[0],1).$a[1]}' forward.fastq | head
>MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCTGAATTACTGGGCTTAAAGCGTACGTAGGCGGATAGGAAAGTTGGGGGTGAAATCCCAG
>MSQ-M01442:38:000000000-A5H4M:1:1102:12736:13241 1:N:0:CGAGGCTGAAGGAGTA
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCGGGGAGGAAGGCGACAAGGTTAATAACCTTGTCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGCAATACGGAGGGTGCAAGCGTTAATCGGAAT
>MSQ-M01442:38:000000000-A5H4M:1:1108:11468:15061 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTGCACAATGGGCGGAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGTTTGTGACGGTACCTGCAGAAGAAGCGCCGGCCAACTACGTGCCAGCAGCCGCGGTAAGACGTAGGGCGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGCTTGTCGCGTCGACTGTGAAAA
>MSQ-M01442:38:000000000-A5H4M:1:2104:20735:20554 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGGGAAGAAGGCCTTCGGGTTGTAAACCGCTTTTGTCAGGGAAGAAATCCTTTGAGTTAATACCTCGGAGGGATGACGGTACCTGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>MSQ-M01442:38:000000000-A5H4M:1:2114:25736:16989 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATCCCGCGTGTGCGATGAAGGCCTTCGGGTTGTAAAGCACTTTTGGCAGGAAAGAAACGGCACGGGCTAATATCCTGTGCAACTGTCGGTACCTGCAGAATAAGCACCGGCTAACTACGTGCCAGCAGCTGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCTTGCGCAGGCGGTT
...
...

We can also draw the PHRED quality score for each read in a graphical manner:

[uzi@quince-srv2 ~/oneliners$ perl -ne 'push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){chomp($a[3]);print substr($a[0],1).join("\n", map{ "_"x(ord($_)-33).(ord($_)-33) }split("",$a[3]))."\n"}' forward.fastq | head -251
MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
___________________________________35
___________________________________35
___________________________________35
___________________________________35
__________________________________34
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
__________________________________34
__________________________________34
___________________________________35
______________________________________38
______________________________________38
______________________________________38
______________________________________38
______________________________________38
______________________________________38
______________________________________38
_______________________________________39
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_______________15
__________________________26
_________________________25
__________________________26
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
______________14
_________________________25
__________________________26
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_________________________25
_____________13
__________________________26
__________________________26
_________________________________33
_________________________________33
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
_____________________________________37
...
...

Extracting information from GBK files

Say we are given a set of GBK files in a folder. They can be genomes from NCBI, or annotated contigs from software such as PROKKA. We can generate a simple tabulated structure (first column being LOCUS and all other columns containing the features we are interested in) using the one-liners mentioned below:

Let us start with some random GBK files I downloaded from NCBI

[uzi@quince-srv2 ~/oneliners/GBK_files$ ls

NC_013164.gbk  NC_014259.gbk  NC_017100.gbk  NC_021150.gbk  NC_021285.gbk

Note: In the one-liners given below, remove: | perl -alne '{print join("\t",@F[0..5])}'
-->to store in a file, replace the above command with > table.tsv
-->to store in a file with unique keywords:, replace with | perl -MList::MoreUtils=uniq -alne '{print join("\t",uniq @F)}' > table.tsv
The perl command containing @F[0..5] is not required as it just to show the structure of output without cluttering this demonstration. Also note that if some genome/contig doesn't contain a feature awk 'NF>1{print $0}' will exclude it from the output
You can check here on how I have used the extracted EC Numbers with a rest-style KEGG API

Extracting /db_xref="taxons:*"

[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/taxon/{print gensub(" +/db_xref=\"taxon:(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}'

NC_013164	525919				
NC_014259	436717				
NC_017100	634453				
NC_021150	1283331				
NC_021285	1167634

Extracting /EC_number="*"

[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}'

NC_013164	2.1.1.37	2.7.7.49			
NC_014259	6.1.1.1		4.2.1.9		4.1.1.31	1.3.1.26	1.1.1.274
NC_021285	3.1.4.-		3.6.3.25	2.4.1.-		2.7.7.7		2.3.1.1

Extracting /db_xref="GeneID:*"

[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/GeneID/{print gensub(" +/db_xref=\"GeneID:(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}'

NC_013164	8368769		8368769		8368657		8368657		8368658
NC_014259	9384321		9384321		9380389		9380389		9380390
NC_017100	12066596	12066596	12066598	12066598	12066600
NC_021150	15372752	15372752	15367888	15367888	15370905
NC_021285	16467804	16467804	16467805	16467805	16467806

Extracting /db_xref="GI:*"

[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/\/db_xref=\"GI/{print gensub(" +/db_xref=\"GI:(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}'

NC_013164	256821124	256821125	256821126	256821127	256821128
NC_014259	299768251	299768252	299768253	299768254	299768255
NC_017100	384049542	384049543	384049544	384049545	384049546
NC_021150	482897842	482899070	482898704	482898122	482899071
NC_021285	528981797	528981798	528981799	528981800	528981801

Extracting /db_xref="CDD:*"

[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/CDD/{print gensub(" +/db_xref=\"CDD:(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}'

NC_013164	200366	100105	100105	234461	200370
NC_014259	221153	232941	99707	99707	99707
NC_017100	218785	238416	223655	238416	238416
NC_021150	221153	232941	99707	99707	99707
NC_021285	183285	239900	239900	239900	170049

Extracting /gene="*"

[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/\/gene=/{print gensub(" +/gene=\"(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}'

NC_014259	dnaA	dnaA	dnaA	dnaA		dnaA
NC_017100	rusA	rusA	rusA	bcr/cflA	bcr/cflA
NC_021150	dnaA	dnaA	dnaA	dnaA		dnaA

Extracting unique /gene="*"

[uzi@quince-srv2 ~/oneliners/GBK_files$ cat *.gbk | awk '/\/gene=/{print gensub(" +/gene=\"(.*)\"","\\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -MList::MoreUtils=uniq -alne '{print join("\t",uniq @F)}'

NC_014259	dnaA	anmK	dkgB	nusB	ribH	psd	pyrE	murG	murC	ddl	lpxC	aceE	glmM	actP	hisH	hisB	ompR	gltX	mraW	mraY	aroK	aroB	gltB	gltD	rpoZ	gmk	ispH	rpsP	rimM	trmD	rplS	truB	glyS	glyQ	aroE	ribA	engB	rpsJ	rplC	rplD	rplW	rplB	rpsS	rplV	rpsC	rplP	rpsQ	rplN	rplX	rplE	rpsN	rpsH	rplF	rplR	rpsE	rpmD	rplO	secY	rpmJ	rpsM	rpsD	rplQ	rplM	ksgA	apaH	secF	secD	tgt	queA	aspS	trpA	mscL	gatB	gatA	thrH	valS	rplU	rpmA	sdhA	sdhB	sucA	sucC	greA	glpD	groES	groEL	dapF	tolB	ruvB	ruvA	dapD	tmk	rnc	era	obgE	mnmA	rpmE	hemE	ampC	trpC	trpD	ubiA	cca	fadE	rpsB	tsf	queF	glyA	ppnK	thiG	gpsA	rpsU	purH	prmA	rplI	rpsR	rpsF	prfA	miaA	pepN	fabG	murB	moaC	fumC	ispF	pyrH	frr	lpxD	fabZ	recA	ompL	argC	pyrG	eno	ispD	aroD	tynA	antC	aspA	recR	hscA	rpsT	rpsA	cmk	gidB	pgk	ruvC	purT	tauD	paaB	paaA	cysS	xseA	benD	trmB	pyrB	rlmL	alaS	rnhB	adk	ureE	ureC	ureA	lysS	metH	dut	aat	rpsL	coaD	smpB	hemA	ipk	rpmF	bioD	dcd	metG	upp	argJ	hisD	hisG	rho	ihfA	pheT	pheS	rplT	rpmI	infC	thrS	panC	panB	mazG	rumA	cysM	prfC	ilvH	leuS	engA	hisS	ispG	ndk	proA	clpX	clpP	tig	metX	thyA	rpmB	rpmG	gltP	truA	infA	leuD	rpmE2	rpsO	infB	nusA	secG	tpiA	coaE	uvrC	fadB	fadA	rpoB	rplL	rplJ	rplA	rplK	hemC	nrdR	def	atpC	argS	fur	guaA	prpB	lldD	pgi	rph	ileS	lspA	dnaK	trmE	rpmH
NC_017100	rusA	bcr/cflA	uvrA	nifU	recA	modE/mopA	mdoG	mdoH	recR	hrcA	exbB/tolQ	exbD/tolR	rodA	ftsI	mreD	mreC	mreB	fusB	marR	degT	suv3	nusG	secE	pfkB	cysA	cysW	cysT	araC	acrB	nodT	luxR	hlyD/fusE	lysR	ligB	ttg2D	djlA	queA	yjgF	ttg2C	uvrB	impB	clpS	clpA	chvG	chvI	grpE	dnaK	dnaJ	duf6	fur	rho	corA	phoB	phoR	oprO	corC/hlyC	eamA	mazG	htpG	xre	iojap	aldC	oprB	ttg1D	nifB/moaA	dedA	matE	nhaAP	fecCD	sss	mauE	lysE	pnuC	badM/rrf2	crcB	tetR	hspA	cobW	uspA	bipA	glnBK	lepA	metW	rhlE	hppK	rpoK	era	ompH	fabZ	yidC	truA	kpsF/gutQ	spoU	dnaB	mscS	radA	gntR	sua5	holC	degP	hrpB	hsp33	recX	tatAE	ribF	mutL	ompA	comA	ahpD	xerD	secB	tim44	mutS2	hicA	hicB	arsR	hap	nol1/nop2	dnaA	cycH	cycL	dsbE	cysK	cycJ	ccmC	rmuC	kdpE	rpoB	rpoC	secY	rpoA	moxR	emrB/qacA	greA	gatB	rpoD	yjgPQ	minE	minD	minC	carD	ftsX	ftsE	hslV	hslU	sufE	prmA	uvrD/rep	radC	secF	secD	smc	phoH	miaB	uvrC	MoeD	MoeE	lexA	moeA	maf	thiO	thiS	recG	cobT	cobS	bolA	hemK	dnaC	tatD	osmC	exsB/queC	smpB	purS	terC	moaC	secG	recF	tolC	nusB	deoR	recQ	rpoE	phnA	apbA	phzCF	xdhC	xdhB	xdhA	marR/hxlR	czcA	hlyD	emrE	asmA	fecR/pupR	copG	rrf5	rrl5	rrs5	cbiD	cbiG	cobH	cobG	fis	fdhD	ftsJ	sirA	hlyA	oppC	oppB	oppD	oppF	npd	padR	fcuA	cydC	cydD	mgtC	czcB/hlyD	czcC	rrf4	rrl4	rrs4	iclR	lolD	lolCE	baf	hu	lon	clpX	clpP	hesB	sufS	sufD	sufC	sufB	hlyB/msbA	mutS	pmbA/tldD	asnC	mucR	mviN	plsX	smpA/omlA	envZ	ompR	rimM	ffh/srp54	atp12	engA	sod	clpB	hflX	hfq	ntrX	ntrY	ntrC	ntrB	nifR3	ispDF	ubiHF	groES	groEL	fkbH	amtB	xerC	yajC	apaG	pfpI	ribD	obgE/cgtA	mraZ	mraW	ftsW	ftsQ	ftsA	ftsZ	recN	copA	copB	recO	copZ	lytB	pirin	glgX	pcpB	sco1/senC	rrf3	rrl3	rrs3	dsbD	tatC	tatB	scpB	scpA	pqqA	pqqB	pqqC	pqqD	pqqE	aatC	recJ	rpoH	fecE	secB-2	pdxJ	cobP	cobD/cbiB	cobQ	cobOP	cbiA	cobN	ycdH	ureG	ureF	ureE	ureD	trbI	trbG	trbF	trbL	repA	rrf2	rrl2	rrs2	xdhC/coxI	truB	cvpA	ftsY	phoU	pstB	pstA	pstC	apbC	nusA	ccmA	ccmB	thiC	ftr1	mutY	rrf1	rrl1	rrs1	ompW	aadR/crp/fnr	htrA	ssb	traR/dksA	tldD	ftsK	nifS	folC	ctrA	ftsH	tilS	tolB	ruvB	ruvA	ruvC	secA	argJ	hisA	hisF	parB	parA	gidB	gidA	trmE/thdF
NC_021150	dnaA	recF	gyrB	hdtS	glyS	glyQ	trkA	sun	fmt	def	lysM	dprA	yrdC	qor	aroE	pcoA	pcoB	cysC2	uvrD	polA	thrB	gadh3	gadh1	gadh2	cyoE	ctaD	purE	purK	modA2	nifH	nifD	nifK	nifT	nifY	lrv	nifE	nifN	nifX	fesI	iscAnif	nifU	nifS	nifV	cysE1nif	nifW	nifZ	nifM	clpX	nifF	osmC	pagL	dsbA	cycA	engB	zwf-4	hexR-3	pycA	pycB	trpI	trpB	trpA	cbpA	clpP	aglA-2	vnfY	vnfK	vnfG	vnfD	vnfF	vnfH	pcaK	moeB2	vnfX	vnfN	vnfE	vnfA	vnfU	relA	rpoZ	gmk	rph	crc	pyrE	argB	algC	dut	coaBC	radC	mtgA	thiG	trmB	rdgB	metW	metX	proC	pilT	pyrC	pyrB	pyrR	gshB	pilG	bioA	ilvD	hipA	coaD	metF	ahcY	ligB	fucP	dctM	dctQ	dctP	dctD2	dctB2	fdhD	fdhA	fdhH	fdhI	fdhE	moaA	mqo	glk	gluP	hoxY	hoxH	hoxW	cooJ	cooT	cooC	cooS	cooF	yggX	mutY	hisB	hisH	hisA	hisF	pgm-1	secB	argE	argA	tonB	gshA	bcsZ	bcsD	bcsC	bcsAB	ompR	envZ	hslO	pckA	cls	nudE	metK	tktA-3	epd	pgk	fba	cccA	katG	glcB	dsbG	bioB	bioF	bioH	bioC	birA	coaX	tuf	secE	nusG	rplK	rplA	rplJ	rplL	rpoB	rpoC	rpsL	rpsG	fusA	rpsJ	rplC	rplD	rplW	rplB	rpsS	rplV	rpsC	rplP	rpmC	rpsQ	rplN	rplX	rplE	rpsN	rplR	rpsE	rpmD	rplO	secY	rpmJ	rpsM	rpsD	rpoA	rplQ	uvrA	oprG-2	oprG-1	nrdR	ribD	ribE	ribB	ribH	nusB	thiL	ribA	retS	purD	purH	fis	prmA	accC	accB	aroQ	dipZ	speA	serB	psd	rhdA	mutL	miaA	hfq	hflX	hflK	hflC	hisZ	purA	rnr	rpsF	rpsR	rplI	dnaB	fpr2	tesB	trmA	bcpB	btuB	dxs-1	ispA-1	xseB-1	ppa	aldA	mpl	ubiX	tonB2	ftsH	lrgA	lrgB	proA	nadD	rodA	mltB	dacA	lipB	lipA	benR	benA	benB	benC	benD	xylJ	xylI	xylH	xylQ	xylT	xylE	xylG	xylK	dmpI	dmpB	dmpP	dmpO	dmpN	dmpM	dpmL	dmpK	holA	leuS	lnt	miaB	rhaU	hpaI	aldB	hemL	thiE	thiD	ureG	ureE	dctA-2	dsbB	leuD2	leuC2	ureC	ureB	ureA	eutB	eutC	rluE	hrpB	cueR	selD	folX	mdcA	citG	mdcC	mdcD	mdcG	madL	madM	nhaA	fixA	fixB	fixC	fixX	tonB3	algA	algF	algV	algI	algL	algX	algG	algJ	algK	alg44	alg8	algD	algB	coxB	coxA	moaC	moaD	moaE	rhlB	estB	apc4	apc3	apc2	apc1	valS	pepA	lptF	lptG	mviN	ribF	ileS	lspA	ispH	fimT	pilE	pilV	pilW	pilY1	thiO	comL	rluD	clpB	ndh	coaE	pilD	pilC	pilB	pilA	nadC	ampD	ampE	fruR	fruB	fruK	fruA	prfC	speC	lldD	gatB	gatA	gatC	mreB	mreC	mreD	maf	tldD	tldE	ptsO	ptsN	rpoN	lptB	lptA	kpsF	murA	hisG	hisD	hisC	algW	cysD	cysN	trpS	nfi	rplM	rpsI	petA	petB	petC	sspA	sspB	gmhA	mraZ	mraW	ftsL	ftsI	murE	murF	mraY	murD	ftsW	murG	murC	ddl	ftsQ	ftsA	ftsZ	lpxC	secA	argJ	apbA	ampG	fabG	pdxH	fxsA	groES	groEL	rsuA	ung	nadB	mucA	mucB	mucC	mucD	lepA	lepB	rnc	era	alyA3	recO	pdxJ	mmsA	lon	putA	metR	metE	argF	bfr	rnt	pyrC2	argG	mobA	moaB	moeA	ech	vdh	fcs	aat1	etfB1	eftA1	pobA	pobR	fadB	fadA	topA	lexA	psrA	mfd	gapB	nqrA	nqrB	nqrC	nqrD	nqrE	nqrF	apbE	sthA	lolC	lolD	lolE	comA	exbB	exbD	lpxK	kdsB	ptpA	murB	rne	rluC	rpmF	plsX	fabD	fabG2	acpP	fabF	pabC	tmk	holB	pilZ	mhpB	mphC	mhpR	mhpD	mhpF	mphE	mhpT	etfB2	eftA2	pyrF	efp	ohr	hmgA	htpX	fadE	queF	trkH	ispZ	scpA	scpB	edd-1	glk2	hexR-2	pgl-1	eda-2	gph	ubiG	mtnA	gyrA	serC	pheA	hisC2	aroA	cmk	rpsA	ihfB	rfaH	rfbB	rfbD	rfbA	rfbC	gspD	gspG2	gspE	cysC1	eexD	eexE	eexF	hrpA	fadB2	etfB3	etfA3	zwf-3	pgi	zwf-2	pgl-2	pgi2	colS	colR	ppdK	dctB	dctD	pepN	pyrD	rlmL	dacB	htpG	glxI	rnfE	rnfG	rnfD	rnfC	rnfB	rnfA	metG	apbC	dcd	dinG	agaA	estC	pdxH2	shaAB	shaC	shaD	shaF	shaG	livG	livM	dnaX	recR	cydB	cydA	cydR	hemN	ccoS	ccoI	ccoH	ccoG	ccoP	ccoQ	ccoO	ccoN	acnA	tusA	tyrB	uvrB	gltX	thrS	infC	rpmI	rplT	pheS	pheT	ihfA	dxs-2	ispA-2	xseB-2	idi	acxA	acxB	acxC	acxR	rhdE	cysE3	alc	entA	entF	entB	entE	csbC	csbX	vbiC	metC	ssuD	tauD	nemA	asfC	asfB	asfA	torG	rpiB	talB-1	tktA-1	rpiA-1	pykA-2	eno-2	xdhC	xdhB	xdhA	dszA	gntT	gntK-1	feoA	feoB	feoC	exeA	greB	oprI	aroG	cysB	cysH	pabB	gntR	prpB	prpC	acnD	prpF	ppsA	rraA	oprF	nasH	nasB	nasA	narK	atsA	pfkB	nasT	nasS	acnB	lpxH	glnS	cysS	folD	tig	clpP2	clpX2	pepS16	ppiD	phbC	phbA	phbB	phbR	phbP	phbF	folE	speE	alyA2	flgN	flgM	flgA	flgB	flgC	flgD	flgE	flgF	flgG	flgH	flgI	flgJ	flgK	flgL	fliR	fliQ	fliP	fliO	fliN	fliM	fliL	fliK	fliJ	fliI	fliH	fliG	fliF	fliE	fliD	fliS	fliT	flhX	flhD	flhC	motA	motB	cheA	cheW	cheR	kdpD	kdpE	kup	kdsA	glgX	treY	malQ	treZ	glgA	galU	gor	gacA	uvrC	pgsA	actP	fpvI	fpvR	groEL2	gapA	gcvP2	sdaA	gcvT3	hutG	hutI	hutU	phyH	hutH	hutF	nifA2	lysE	pvdE	ptxS	kguE	kguK	kguT	kguD	fagA	fumC	eno-1	pykA-1	eda-1	zwf-1	hexR-1	edd-2	gntK-2	mtlY	tktA-2	talB-2	mtlD	mtlK	mtlG	mtlF	mtlE	mtlR	pgm-2	ybhE	cysI	nudC	motD	motC	fliC	rfbC2	rfbG	rfbF	hicB	flaG	fliA	flhE	flhF	flhA	flhB	cheZ	cheY	cheB	glgB	treS	glgE	cobA	serS	rarA	lolA	ftsK	trxB	aat	infA	clpA	clpS	cspD	icd	mnmA	purB	nuoA	nuoB	nuocd	nuoE	nuoF	nuoG	nuoH	nuoI	nuoJ	nuoK	nuoL	nuoM	nuoN	gspG1	gspF	nfuA	xth	ppiC	sixA	gpsA	fabA	fabB	metH	arsD	arsC	arsB	arsA	dnaQ	rnhA	fabI	sucC	lpdA	sucB	sucA	sdhB	sdhA	sdhD	sdhC	gltA	wbpO	wbpP	phaJ	codA	hepA	ligA	zipA	smc	moaA2	cycH	ccmH	ccmG	ccmF	ccmE	ccmD	ccmC	ccmB	ccmA	fleN	sodA	ssuB	ssuC	ssuA	ssuE	cysP	lapQ	lapG	lapF	lapI	lapH	lapE	lapC	arsE	lapP	lapO	lapN	lapM	lapL	lapK	lapB	lapR	ttdB	ttdA	dctA-3	dszA2	cas2	cas1	cas4	csd2	csd1	cas5d	oprE	cysT	cysW	cysA	alyA1	gcvT2	hpaR	talB-3	cyaB	recQ	cysK	rnd	cobS	cobC	cobT	cobU	cobQ	cobD	cbiA	btuB2	aroF	dctA-1	pcpS	nrdA	nrdB	vnfA2	algE5	mexT	mexE	mexF	ppnK	metZ	purF	folC	accD	trpF	truA	asd	leuB	leuD	leuC	dusC	fleQ	algZ	mgtE	csrA	lysC	alaS	argD	acsA1	pta-1	ackA-1	phhA	phhB	ggpS	hppD	fumC2	mnmC	asnB	actP2	htrB	minC	minD	minE	rluA	rdgC	purC	dapA	nadA	frdC	trbG	trbF	trbL	trbJ	trbE	trbD	trbC	trbB	traF	oprL	tolB	tolA	tolR	tolQ	ybgC	ruvB	ruvA	ruvC	aspS	proS	pgsA2	purM	purN	relB	relE1	mazG	ilvM	rumA	cysM	gacS	dinP	tmp	lysB	rimO	pfpI	cls2	sodB	araJ	aroQ2	pip	hppD2	pcaC	pcaD	pcaB	pcaF	pcaJ	pcaI	pcaR	pcaG	pcaH	pcaQ	purU	mvaT	sbcB	pykA-6	fumB	fpr1	finR	recX	recA	ibpB	ibpA	mutS	fdxA	aerP	rpoS	nlpD	pcm	surE	truD	ispF	fghA	ispD	ftsB	eno	kdsA2	pyrG	metJ	accA	dnaE	rnhB	lpxB	lpxA	fabZ	lpxD	ompH	mucP	dxr	csdA	uppS	frr	pyrH	tsf	rpsB	map	glnD	dapC	dapD	thiF	dapE	rrmA	cspA	yeaZ	adk	ppc	lysS	prfB	recJ	yaeQ	thrC	hom	dsbC	xerD	rplS	trmD	rimM	rpsP	ffh	purT	purL	guaA	xseA	acoK	gcd	pvdH	fpvB	engA	hisS	ispG	pilF	ndk	iscX	fdx	hscA	hscB	iscA	iscU	iscS	iscR	cysE2	trmH	suhB	secF	secD	yajC	tgt	queA	rpsT	proB	obgE	rpmA	rplU	gerC	fklB	fpvA	glyA	cobW	cstA	radA	mscL	ackA-2	pta-2	mqo2	hpt	upp	hemH	phr	murI	prfA	hemA	lolB	ipk	pth	ychF	pqqF	pqqE	pqqD	pqqC	pqqB	acoR	acoA	acoB	acoC	adh	otsA	otsB	rimI	leuA	pqiA	pssA	ilvC	ilvH	ilvI	mrcB	hmuV	sfsA	aspT	dksA	cbrA	cbrB	pcnB	folK	panB	panC	panD	pgi3	acsA2	pnp	rpsO	truB	rbfA	infB	nusA	secG	glmM	folP	ftsH1	rrmJ	greA	carB	carA	dapB	dnaJ	dnaK	grpE	recN	fur	omlA	smpB	lctP	mosA	mosB	maeB2	lldp	glcB2	glcG	glcF	glcE	glcD	glcC	eno-3	pykA-3	cdaR	yhaD	gcl	garR	pirA	parC	parE	sfnG	seuB	ribAB	livH	seuA	metB	msuD	sfnR	pta-3	ackA-3	thiC	rfaE	msbA	galE	waaP	waaG	ilvE	glnE	aceE	aceF	msrA	relE2	hemE	gltD	gltB	aroB	aroK	pilQ	pilP	pilO	pilN	pilM	ponA	maeB	rpmE	priA	argS	hslV	hslU	ubiE	ubiB	hisI	hisE	tatA	tatB	tatC	mdoH2	mdoD	mdoH1	mdoG	dtd	fbp	glpF	glpK	glpR	glpD	eno-4	tpiA-2	typA	thiI	glnA	ntrB	ntrC	ahpC	bioD	tyrS	anmK	erpA	argC	coq7	speD	crp	trpC	trpD	trpGD	pdhB	pdhA	cadR	pncA	mdaB	ada	hpaI2	trpE	gph2	rpe	gabT	lptD	pdxA	ksgA	apaG	apaH	glpE	prkA	cca	folK1	folB	gcp	rpsU	dnaG	rpoD	cooA	katN	vnfA3	fecA	gcvP1	gcvH	gcvT	ubiF	ubiH	ubiD	rho	trxA	ppx	ppk	hemB	feoB2	algQ	dsbB2	hemD	hemC	algR	argH	corA	cyaY	lppL	lysA	dapF	xerC	amtB	glnK	pirR	rep	xpt	cycB	dadX	dadA	lrp	aldH	rpmG	rpmB	rpoH	ftsX	ftsE	ftsY	glpQ	thyA	lgt	ptsP	nudH	ilvA	rpiA-2	serA	recG	muc26	hupA	ubiC	ubiA	phoB	phoR	phoU	pstB	pstA	pstC	pstS	potI	potH	potG	potF	spuC	spuB	spuA	anfR	anfO	anfK	anfG	anfD	anfH	anfA	anfU	gabD	algY	tctE	tctD	tctC	tctB	tctA	fic	tpiA-3	tpiA-1	eno-5	pykA-5	iolC	iolE	iolB	iolT	iolA	iolD	idh	iolH	hppD3	fahA	maiA	nafY	rnfH	rnfE1	rnfG1	rnfD1	rnfC1	rnfB1	rnfA1	nifL	nifA	nifB	nifO	nifQ	norR	hmp	algE3	algE2	algE1	algE4	algE6	algE7	dgoT	galD	dgoA	dgoK	dgoR	pykA-4	melA	aglA-1	scrY-2	hbdH	bktA	lpxO	recD	recB	recC	nhaA2	scrY-1	lacY	scrB	scrR	glmS	glmR	glmU	atpC	atpD	atpG	atpA	atpH	atpF	atpE	atpB	atpI	parB	parA	gidB	gidA	trmE	rnpA	rpmH

Identifying duplicates in two FASTA files (awk)

Say we have two FASTA files that contain some duplicate sequences (exact match) but with different IDs.

[uzi@quince-srv2 ~/oneliners$ cat f1.fasta
>F1
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
AGCGTGGGGAGCAAACAGGATTAGATACCCTAGTAGTC
>F2
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
TAGGGGAGCGAATGGGATTAGATACCCTAGTAGTC
>F3
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
TAGGGGAGCGAATGGGATTAGATACCCGAGTAGTC
>F4_C1
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
TAGGGGAGCGAATGGGATTAGATACCCTTGTAGTC
>F5_C2
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
AGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTC
[uzi@quince-srv2 ~/oneliners$ cat f2.fasta
>C1_F4
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
TAGGGGAGCGAATGGGATTAGATACCCTTGTAGTC
>C2_F5
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
AGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTC
>C3
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
AGCGTGGGGAGCAAACAGGATTAGATACCCTAGTAGTC
>C4
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
AGCGTGGGGAGCAAACAGGATTAGATACCCGGGTAGTC
>C5
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
TAGGGGAGCGAATGGGATTAGATACCCTGGTAGTC

We can use the following one-liner to identify list of unique sequences per line with duplicates as comma separated list:

[uzi@quince-srv2 ~/oneliners$ cat f1.fasta f2.fasta | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' | awk 'NR>1{ if(NR%2==0){gsub(">","",$1);h=$1} else {a[$1]++;b[$1]=b[$1]","h}} END { for (n in a) {gsub("^,","",b[n]);print b[n]} }'
F3
F4_C1,C1_F4
C4
C3
F2
C5
F5_C2,C2_F5
F1

In the above one-liner, the first awk statement linearizes the FASTA file so that header and sequence alternate on separate lines.To identify duplicates only, just add an extra awk statement at the end:

[uzi@quince-srv2 ~/oneliners$ cat f1.fasta f2.fasta | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' | awk 'NR>1{ if(NR%2==0){gsub(">","",$1);h=$1} else {a[$1]++;b[$1]=b[$1]","h}} END { for (n in a) {gsub("^,","",b[n]);print b[n]} }' | awk -F, 'NF>1'
F4_C1,C1_F4
F5_C2,C2_F5

Converting "Sample[TAB]Feature[TAB]Abundance" list to tab-delimited abundance table

Say the data is in the following format:

[uzi@quince-srv2 ~/oneliners$ cat test.txt
SAMPLE	MIR	ABUNDANCE
sample1	mir1	30
sample1	mir3	100
sample1	mir4	120
sample2	mir1	40
sample2	mir2	200
sample3	mir1	190
sample3	mir1	400
sample4	mir4	20
sample5	mir1	19

You can use the following one-liner to convert the above data into tab-delimited abundance table:

[uzi@quince-srv2 ~/oneliners$ perl -ane 'if ($. > 1){$r{$F[0].":".$F[1]}=$F[2];unless($F[0]~~@s){push @s,$F[0];}unless($F[1]~~@m){push @m,$F[1];}}END{print "Samples\t".join("\t",@s)."\n";for($i=0;$i<@m;$i++){print $m[$i];for($j=0;$j<@s;$j++){(not defined $r{$s[$j].":".$m[$i]})?print "\t".0:print"\t".$r{$s[$j].":".$m[$i]};}print "\n";}}' test.txt
Samples sample1 sample2 sample3 sample4 sample5
mir1 	30 	40 	400 	0 	19
mir3 	100 	0 	0 	0 	0
mir4 	120 	0 	0 	20 	0
mir2 	0 	200 	0 	0 	0

Dereplicating Reads

We can also dereplicate the reads from FASTQ files and generate a sorted list of reads in usearch header format as a FASTA file:

Single-end FASTQ file

[uzi@quince-srv2 ~/oneliners$ perl -MDigest::MD5=md5_hex -ne 'push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){chomp($a[1]);$d{uc($a[1])}++;} END { foreach my $n (sort{ $d{$b} <=> $d{$a} } keys %d){print ">".md5_hex($n).";size=".$d{$n}.";\n".$n."\n"}}' forward.fastq > forward_derep.fasta
[uzi@quince-srv2 ~/oneliners$ head forward_derep.fasta
>55961afb4a0df6979393aa7f2ad02cac;size=340;
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>c5a3043fda41c9b4f972826c3bba8154;size=263;
CCTATGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>12071fab7db0b36abcc540059f9bb78b;size=119;
CCTACGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCGGAATTACTGGGCGTAAAGCGTACGTAGGCGGATTGGAAAGTTGGGGGTGAAATCCCAG
>76d3994572a1c37336f2b4dd897434a7;size=107;
CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGAGAGCCTGAACCAGCCAAGTCGCGTGAAGGAAGACTGTCCTAAGGATTGTAAACTTCTTTTATACGGGAATAACGGGCGATACGAGTATTGCATTGAATGTACCGTAAGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGTGCGTAGGTTGTTCGGTA
>c6e34aa95f6b068fcd12e3308e92815a;size=104;
CCTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGGCGCAGGCCTGAACCAGCCAAGTAGCGTGAAGGATGACTGCCCTATGGGTTGTAAACTTCTTTTATATGGGAATAAAGTTTTCCACGTGTGGAATTTTGTATGTACCATATGAATAAGGATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATCCGAGCGTTATCCGGATTTATTGGGTTTAAAGGGAGCGTAGGTGGACAGTTA

Paired-end FASTQ files

[uzi@quince-srv2 ~/oneliners$ paste forward.fastq reverse.fastq | perl -MDigest::MD5=md5_hex -ne 'chomp($_);push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){@q=split("\t",$a[1]);$d{uc($q[0].":".$q[1])}++;} END {open(FF,">","forward_derep.fasta");open(RR,">","reverse_derep.fasta");foreach my $n (sort{ $d{$b} <=> $d{$a} } keys %d){$l=md5_hex($n);@r=split(":",$n);print FF ">".$l." 1;size=".$d{$n}.";\n".$r[0]."\n"; print RR ">".$l." 2;size=".$d{$n}.";\n".$r[1]."\n"; }}'
[uzi@quince-srv2 ~/oneliners$ head *_derep.fasta
==> forward_derep.fasta <==
>948b3e6c0d9c5140a20b3e950baf513b 1;size=45;
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>84ee339f8f7d47fb5278cc7d211c2856 1;size=32;
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>087be525fc7401eb8dc965c0b0b20a83 1;size=30;
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>5046bcb4cbabfb76f0f6966c0ada2695 1;size=30;
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
>792f79536de130a5e6ded001bbe8a191 1;size=28;
CCTATGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGTAGGGAGGAAAGGTAGCAGCTTAATACGCTGTTGCTGTGACGTTACCTACAGAAGAAGGACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTCCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT

==> reverse_derep.fasta <==
>948b3e6c0d9c5140a20b3e950baf513b 2;size=45;
GACTACCAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG
>84ee339f8f7d47fb5278cc7d211c2856 2;size=32;
GACTACCCGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG
>087be525fc7401eb8dc965c0b0b20a83 2;size=30;
GACTACAAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG
>5046bcb4cbabfb76f0f6966c0ada2695 2;size=30;
GACTACCGGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG
>792f79536de130a5e6ded001bbe8a191 2;size=28;
GACTACTAGGGTATCTAATCCTGTTTGCTCCCCACGCTTTCGTGCCTGAGCGTCAGTCTTTGTCCAGGGGGCCGCCTTCGCCACCGGTATTCCTCCAGATCTCTACGCATTTCACCGCTACACCTGGAATTCTACCCCCCTCTACAAGACTCTAGTTCGCCAGTTCGAAATGCAATTCCCAGGTTGAGCCCGGGGCTTTCACATCTCGCTTAACAAACCGCCTGCGCACGCTTTACGCCCAGTAATTCCG

Paired-end assembler

Here, we are trying to emulate the functionality of pandaseq using EMBOSS water (which is an implementation of Smith-Waterman local alignment algorithm):

[uzi@quince-srv2 ~/oneliners$ paste forward.fastq reverse.fastq | perl -ne 'chomp($_);push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){@q=split("\t",$a[1]); $q[1]=join("",map{$_ =~ tr/ACGT/TGCA/; $_} reverse split("",$q[1])) ;$r=qx/water -asequence=asis:$q[0] -bsequence=asis:$q[1] -gapopen=50 -gapextend=0.5 -stdout -auto -aformat3 markx1/;$r=~/\s*asis (.*)\n\s*asis (.*)/;if ($1 ne ""){@h=split(" ",$a[0]);$m_f=$1;$m_r=$2;$q[0]=~/(.*)$m_f/;$r_l=$1;$q[1]=~/$m_r(.*)/;$r_r=$1;print ">".substr($h[0],1)."\n".$r_l.$m_f.$r_r."\n"}}' > merged.fasta

This one-liner is useful for assembling paired-end reads for 16S rRNA amplicons where the primers are designed in such a manner that there is a substantial overlap between the paired-end reads. The above one-liner could be improved further. I am using a very high Gap Open Panelty which gives me only one local alignment and spans one alignment record in the output file generated by water. If I make it smaller, my regular expressions will fail as the alignments output by water will span multiple lines (later I'll update it). Moreover, I am not taking the consensus of the overlapped region, rather keeping the region from forward reads with the view that generally reverse reads have lower quality than the forward reads.
For comparative purposes, I have generated merged_pandaseq.fasta from pandasq:

pandaseq -f forward.fastq -r reverse.fastq -o 10 > merged_pandaseq.fasta

where I have used -o 10, -o 20, and -o 50, respectively. I then used bowtie2 to map the reads against the reference database:

bowtie2-build ref.fasta ref	
bowtie2 -t ref -f reads.fasta > reads.sam

and generated the following table:

Algorithms	Perl one-liner (gapopen=50,gapextend=0.5)	pandaseq (o=10)	pandaseq (o=20)	pandaseq (o=50)
Total Reads	10000	10000	10000	10000
Overlapped Reads	7197	9753	9592	3440
No Alignment	34 (0.47%)	314 (3.22%)	273 (2.85%)	23 (0.67%)
Aligned exactly 1 times	684 (9.50%)	920 (9.43%)	903 (9.41%)	263 (7.65%)
Aligned > 1 times	6479 (90.02%)	8519 (87.35%)	8416 (87.74%)	3154 (91.69%)
Overall alignment rate	99.53%	96.78%	97.15%	99.33%
% Overlapped reads	71.97%	97.53%	95.92%	34.4%

It can be seen that my one-liner hasn't done a bad job. I am getting around 72% overlapped reads and out of which 99.53% are matching. pandaseq on the other hand is sensitive to the minimum overlap parameter.

Generating abundance tables and trees from CREST and RDP classifiers

These taxonomic analysis software are frequently used in metagenomic analysis to generate taxonomic assignments for small-subunit ribosomal RNA (SSU rRNA) sequences. They rely on the lowest common ancestor (LCA) based taxonomic classification method by searching sequences against public databases, assigning them to the most similar taxa in a given taxonomy, and generating the abundance tables.

Abundance table from CREST: The following two commands will produce an assignment fle named input_file_Assignments.txt for input_file.fa:

$ megablast -i input_file.fa -b 50 -v 50 -m 7 -d $PATH_TO_CREST_DB/silvamod.fasta -a 10 -o input_file.xml
$ classify -i input_file.fa -a -p input_file.xml

To extract an abundance table and the corresponding tree from CREST, the data should first be organized to follow the folder structure shown below.

This organisation makes it easier to run repetitive functionality on the subfolders, each containing an assignment file for a sample with sample names as folder names or what will appear in the abundance table. Moreover, with this structure, we can use the collateResults.pl utility that: searches subfolders for files with names matching a certain pattern; extracts the quantitative information stored in a delimited format in them; and then merges the information from all of the subfolders together. The following bash one-liner will produce an abundance table (test.csv) and the corresponding tree in the Newick format (test.nwk) at Phylum level for a single sample when run inside the S1 folder on the terminal:

[uzi@quince-srv2 ~/oneliners/CREST/Main/S1]$ cat *_Assignments.txt | awk -F"\t" -v k="test.csv" -v l=3 '{a=match($1,"size=.*;");if(a==0) c=1; else c=substr($1,RSTART+5,RLENGTH-6);$0=$2";";gsub("\\(","[",$0);gsub("\\)","]",$0);gsub(" ","_",$0);split($0,ss,";");$0="";for (i=1;i<=l;i++) $0=$0";"ss[i];gsub("^;","",$0);$0=$0";"; gsub(".*;;$","Cellular_organisms;__Unknown__;",$0);gsub("\"","",$0);b[$0]+=c} END {for (i in b) print i;print "Sample,S1" > k ;for (i in b) print gensub(".*;(.*?);","\\1","g",i)","b[i] >> k}' | java -classpath ~/bin Path2Newick > test.nwk

The following set of commands will produce an abundance table (collated.csv) and the corresponding tree in the Newick format (collated.nwk) at Phylum level for multiple samples when run inside the Main folder on the terminal:

[uzi@quince-srv2 ~/oneliners/CREST/Main]$ for i in $(ls -d */); do i=${i%%/}; cd $i; cat *_Assignments.txt | awk -F"\t" -v k="test.csv" -v l=3 '{a=match($1,"size=.*;");if(a==0) c=1; else c=substr($1,RSTART+5,RLENGTH-6);$0=$2";";gsub("\\(","[",$0);gsub("\\)","]",$0);gsub(" ","_",$0);split($0,ss,";");$0="";for (i=1;i<=l;i++) $0=$0";"ss[i];gsub("^;","",$0);$0=$0";"; gsub(".*;;$","Cellular_organisms;__Unknown__;",$0);gsub("\"","",$0);b[$0]+=c} END {for (i in b) print i;for (i in b) print gensub(".*;(.*?);","\\1","g",i)","b[i] >> k}'; cd ..; done | java -classpath ~/bin Path2Newick > collated.nwk
[uzi@quince-srv2 ~/oneliners/CREST/Main]$ perl ~/bin/collateResults.pl -f . -p test.csv > collated.csv

The above bash one-liner generates a test.csv file in each subfolder, which is then used in the second command to collate the results together to produce collated.csv.

Abundance table from RDP: The following command will produce an output file named input_file_Assignments.txt for input_file.fa:

$ java -Xmx1g -jar classifier.jar classify -f filterbyconf -o input_file_Assignments.txt input_file.fa

We follow the same folder structure as mentioned previously. The bash one-liners for RDP are slightly different due to the different format of the assignment files, however they follow the same philosophy. The following bash one-liner will produce an abundance table (test.csv) and the corresponding tree in the Newick format (test.nwk) at Phylum level for a single sample when run inside the S1 folder on the terminal:

[uzi@quince-srv2 ~/oneliners/RDP/Main/S1]$ cat *_Assignments.txt | awk -F"\t" -v k="test.csv" -v l=3 '{a=match($1,"size=.*;");if(a==0) c=1; else c=substr($1,RSTART+5,RLENGTH-6);$0="Cellular_organisms;"$3";"$6";"$9";"$12";"$15";"$18";";gsub("\\(","[",$0);gsub("\\)","]",$0);gsub(" ","_",$0);split($0,ss,";");$0="";for (i=1;i<=l;i++) $0=$0";"ss[i];gsub("^;","",$0);$0=$0";"; gsub(".*;;$","Cellular_organisms;__Unknown__;",$0);gsub("\"","",$0);b[$0]+=c} END {for (i in b) print i;print "Sample,S1" > k ;for (i in b) print gensub(".*;(.*?);","\\1","g",i)","b[i] >> k}' | java -classpath ~/bin Path2Newick > test.nwk

[uzi@quince-srv2 ~/oneliners/RDP/Main]$ for i in $(ls -d */); do i=${i%%/}; cd $i; cat *_Assignments.txt | awk -F"\t" -v k="test.csv" -v l=3 '{a=match($1,"size=.*;");if(a==0) c=1; else c=substr($1,RSTART+5,RLENGTH-6);$0="Cellular_organisms;"$3";"$6";"$9";"$12";"$15";"$18";";gsub("\\(","[",$0);gsub("\\)","]",$0);gsub(" ","_",$0);split($0,ss,";");$0="";for (i=1;i<=l;i++) $0=$0";"ss[i];gsub("^;","",$0);$0=$0";"; gsub(".*;;$","Cellular_organisms;__Unknown__;",$0);gsub("\"","",$0);b[$0]+=c} END {for (i in b) print i;for (i in b) print gensub(".*;(.*?);","\\1","g",i)","b[i] >> k}'; cd ..; done | java -classpath ~/bin Path2Newick > collated.nwk
[uzi@quince-srv2 ~/oneliners/RDP/Main]$ perl ~/bin/collateResults.pl -f . -p test.csv > collated.csv

For producing trees at other taxonomic levels, in the first awk statement in the above bash one-liners, use: l=4 for Class, l=5 for Order, l=6 for Family, and l=7 for Genus, respectively. These bash one-liners will work even if we run the classifiers on the dereplicated reads in the usearch header format i.e. containing the string size=.* in the FASTA headers. If the original dereplicated sequences are in a different format, we first convert them to the usearch header format before running the classifiers and using the above bash one-liners. For example, if we have generated the denoised and dereplicated reads from AmpliconNoise, then the following bash one-liner can be used to convert the FASTA headers:

$ awk '/>/{$0=gensub("(.*?)_(.*?)$","\\1;size=\\2;","g",$0)}1' dereplicated.fa > dereplicated_usearch_format.fa

Note: Delete test.csv file from individual folders everytime you run bash one-liners for multiple samples (say for a different value of l)

These bash one-liners rely on Path2Newick.java utility to convert extracted paths to tree in the Newick format.Both the abundance table and tree can then be used in PHYLOmap.py to generate heatmap. The working of Path2Newick and an example heatmap generated from PHYLOmap is given below

Spatial comparison of read qualities between different sequencing runs (Illumina paired-end reads)

Step 1: We first get the average quality score and flow cell information and store as DAT files, i.e., redirect output to FILE.dat in the following example

perl -ne 'chomp;$_=~m/@(.*?):(.*?):(.*?):(.*?):(.*?):(.*?):(.*?)\s/;print $4.",",$5.",".$6.",".$7;<>;<>;$_=<>;chomp;map{$s+=ord($_)-33}split("");print",".($s/length($_))."\n";$s=0;' < FILE.fastq > FILE.dat

This one-liner gives:

Column 1: Flow cell lane
Column 2: Tile number within the flowcell lane
Column 3: x-coordinate of the cluster within the tile
Column 4: y-coordinate of the cluster within the tile
Column 5: Average quality score

For example,

	
[uzi@quince-srv2 ~/oneliners]$ perl -ne 'chomp;$_=~m/@(.*?):(.*?):(.*?):(.*?):(.*?):(.*?):(.*?)\s/;print $4.",",$5.",".$6.",".$7;<>;<>;$_=<>;chomp;map{$s+=ord($_)-33}split("");print",".($s/length($_))."\n";$s=0;' < 5_TAAGGCGA-TAGATCGC_L001_R2_001.fastq | head -10
1,1101,9593,26322,26.4814814814815
1,1101,7952,34132,27.6470588235294
1,1101,8189,100645,26.3333333333333
1,1102,10633,88138,25.8048780487805
1,1103,5153,25465,36.7029702970297
1,1103,9533,58452,27.4752475247525
1,1103,21248,64336,28.1842105263158
1,1104,19815,7437,23.0769230769231
1,1105,14548,7890,30.4653465346535
1,1105,8795,73701,21.3076923076923

Step 2: Having generated the individual dat files (*_R1.dat and *_R2.dat), collate the forward reads and reverse reads separately to generate R1.dat and R2.dat, respectively.

[uzi@quince-srv2 ~/oneliners]$ cat *_R1.dat > R1.dat 
[uzi@quince-srv2 ~/oneliners]$ cat *_R2.dat > R2.dat

Step 3: Find unique flow cell information from both R1.dat and R2.dat

[uzi@quince-srv2 ~/oneliners]$ awk -F"," '{print $2}' R1.dat | sort | uniq > R1.uniq
[uzi@quince-srv2 ~/oneliners]$ awk -F"," '{print $2}' R2.dat | sort | uniq > R2.uniq

Step 4: Extract data for each flow cell in a separate folder and save it as space-separated (required by gnuplot) text file. This will create TXT files 1101.txt, 1102.txt, and so on.

[uzi@quince-srv2 ~/oneliners]$ for i in $(cat R1.uniq); do awk -F"," -v pattern=$i '$2==pattern{gsub(","," ",$0);print $0}' R1.dat > R1/$i.txt; done
[uzi@quince-srv2 ~/oneliners]$ for i in $(cat R2.uniq); do awk -F"," -v pattern=$i '$2==pattern{gsub(","," ",$0);print $0}' R2.dat > R2/$i.txt; done

Step 5: Make a file gnuscript.txt with the following contents and place it in both R1 and R2 folders. This script when called generates a scatterplot as *.txt.ps file

PATH=$0
DATAFILE=$1
set palette
plot PATH."/".DATAFILE using 3:4:5 with dots palette notitle
set xlabel "X"
set ylabel "Y"
set term postscript color
set output DATAFILE.'.ps'
replot
set term x11

Step 6: Call the script on individual folders and use ImageMagick to convert the *.txt.ps into *.txt.jpg, for example, in R1 folder:

[uzi@quince-srv2 ~/oneliners/R1]$ for i in $(ls ????.txt); do echo processing $i now;echo "call \"gnuscript.txt\" \"'.'\" \"'$i'\"" | gnuplot; convert $i.ps $i.jpg; done

Step 7: To get the total reads for each tile, use

[uzi@quince-srv2 ~/oneliners/R1]$ awk -F"," '{print $2}' R1.dat | sort | uniq -c   
[uzi@quince-srv2 ~/oneliners/R1]$ awk -F"," '{print $2}' R1.dat | sort | uniq -c

Note: I have used the above one-liners to assess the read qualities between two runs (Run1 and Run2) where flow cells of Run1 are less dense (roughly 1/7) than Run2, with total number of flow cells being 28. As a result of high density, I was getting fewer overlapping reads (assembled from forward and reverse reads) for Run2. In the plots generated from Step 6, reads with lighter shades correspond to higher PHRED quality scores. The following observations were then made from the results given below: in general, forward reads are better quality than reverse reads; noisy reads show some sort of spatial trend, i.e. reads on the edges are more noisy; there is missing data around the edges and are those reads filtered out by QC of reads before processing; and there are rectangular patches that are more prominent for Run1. In some cases, the average read qualities in patches differ. Morever, in the plot for the flow cell 1114, Run2, one can notice a purple artifact in both forward and reverse reads. Similarly, for the flow cell 2114, reverse reads are worse than other flow cells.

1101.jpg 1102.jpg 1103.jpg 1104.jpg 1105.jpg 1106.jpg 1107.jpg 1108.jpg 1109.jpg 1110.jpg 1111.jpg 1112.jpg 1113.jpg 2101.jpg 2102.jpg 2103.jpg 2104.jpg 2105.jpg 2106.jpg 2107.jpg 2108.jpg 2109.jpg 2110.jpg 2111.jpg 2112.jpg 2113.jpg

Subsampling FASTA and FASTQ files

We can use reservoir sampling to subsample FASTA and FASTQ files. I have the following perl one-liners (though I'd prefer awk one-liner in this case) to subsample paired-end FASTQ files (the first one generates the initial seed based on compressed output of "ps awww" command):

[uzi@quince-srv2 ~/oneliners]$ paste forward.fastq reverse.fastq | perl -ne 'BEGIN{$k=10;srand (time ^ $$ ^ unpack "%L*", `ps axww | gzip`)}chomp($_);push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){$s= ($x++<$k) ? $x-1:int(rand()*$x);if($s<$k){$R{$s}=join("\t",@a)}} END { open(FR,">","forward_sub.fastq");open(RR,">","reverse_sub.fastq");foreach(keys%R){@a=split("\t",$R{$_});print FR join("\n",@a[grep { !($_ & 1) } 0 .. $#a])."\n"; print RR join("\n",@a[grep { ($_ & 1) } 0 .. $#a])."\n"}}'
[uzi@quince-srv2 ~/oneliners]$ paste forward.fastq reverse.fastq | perl -ne 'BEGIN{$k=10;srand (time ^ $$)}chomp($_);push @a, $_; @a = @a[@a-4..$#a]; if ($. % 4 == 0){$s= ($x++<$k) ? $x-1:int(rand()*$x);if($s<$k){$R{$s}=join("\t",@a)}} END { open(FR,">","forward_sub.fastq");open(RR,">","reverse_sub.fastq");foreach(keys%R){@a=split("\t",$R{$_});print FR join("\n",@a[0,2,4,6])."\n"; print RR join("\n",@a[1,3,5,7])."\n"}}'

In the above one-liners, you can change $k to the number you want to downsample your reads to.

Reservoir sampling: It is a simple random sampling strategy to produce a sample without replacement from a stream of data in one pass: O(N). We want to sample s instances - uniformly at random without replacement - from a population size of n records, where n is not known. Figuring out n would require 2 passes. Reservoir sampling achieves this in 1 pass. A reservoir R here is simply an array of size s. Let D be data stream of size n, then the algorithm is as follows:

Store first s elements into R.
	for each element in position k = s+1 to n ,
		accept it with probability s/k
		if accepted, choose a random element from R to replace.

Using reservoir sampling, we have written several bash scripts that allow us to subsample single-end and paired-end fasta as well as fastq files. The content of these scripts are as follows:

subsample_se_fasta.sh:

Example usage: ./subsample_se_fasta.sh s.fasta s_sub.fasta 100000

#!/bin/bash
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < $1 | awk 'NR>1{ printf("%s",$0); n++; if(n%2==0) { printf("\n");} else { printf("\t");} }' | awk -v k=$3 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<k?x-1:int(rand()*x);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | awk -F"\t" -v f="$2" '{print $1"\n"$2 > f}'

subsample_se_fastq.sh:

Example usage: ./subsample_se_fastq.sh s.fastq s_sub.fastq 100000

#!/bin/bash
cat $1 | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=$3 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<k?x-1:int(rand()*x);if(s<<)R[s]=$0}END{for(i in R)print R[i]}' | awk -F"\t" -v f="$2" '{print $1"\n"$2"\n"$3"\n"$4 > f}'

subsample_pe_fasta.sh:

Example usage: ./subsample_pe_fasta.sh f.fasta r.fasta f_sub.fasta r_sub.fasta 100000

#!/bin/bash
paste <(awk '/^>/ {printf("\n%s\n",$0);next; } {printf("%s",$0);}  END {printf("\n");}' < $1) <(awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < $2) | awk 'NR>1{ printf("%s",$0); n++; if(n%2==0) { printf("\n");} else { printf("\t");} }' | awk -v k=$5 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<k?x-1:int(rand()*x);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | awk -F"\t" -v f="$3" -v r="$4" '{print $1"\n"$3 > f;print $2"\n"$4 > r}'

subsample_pe_fastq.sh:

Example usage: ./subsample_pe_fastq.sh f.fastq r.fastq f_sub.fastq r_sub.fastq 100000

#!/bin/bash
paste $1 $2 | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=$5 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++<<k?x-1:int(rand()*x);if(s<k)R[s]=$0}END{for(i in R)print R[i]}' | awk -F"\t" -v f="$3" -v r="$4" '{print $1"\n"$3"\n"$5"\n"$7 > f;print $2"\n"$4"\n"$6"\n"$8 > r}'

Identifying low-coverage genomes in metagenomic shotgun sequencing samples (mock communities)

Here I am using an example dataset to identify the list of genomes with low coverage. It is assumed that BWA, Samtools, and Bedtools are installed. Here are the steps:
Step 1: Index the reference database file (my reference database comprises 59 genomes)

[uzi@quince-srv2 ~/oneliners]$ bwa index ref.fasta

Step 2: Use BWA-MEM to align paired-end sequences. Briefly, the algorithm works by seeding alignments with maximal exact matches (MEMs) and then extending seeds with the affine-gap Smith-Waterman algorithm (SW)

[uzi@quince-srv2 ~/oneliners]$ bwa mem ref.fasta forward.fastq reverse.fastq > aln-pe.sam

Step 3: Convert sam file to binary bam file

[uzi@quince-srv2 ~/oneliners]$ samtools view -h -b -S aln-pe.sam > aln-pe.bam

Step 4: Extract only those sequences that were mapped against the reference database. Use -F 4 switch.

[uzi@quince-srv2 ~/oneliners]$ samtools view -b -F 4 aln-pe.bam > aln-pe.mapped.bam

Step 5: Generate a file length.genome that contains two entries per row, genome identifier and genome length.

[uzi@quince-srv2 ~/oneliners]$ samtools view -H aln-pe.mapped.bam | perl -ne 'if ($_ =~ m/^\@SQ/) { print $_ }' | perl -ne 'if ($_ =~ m/SN:(.+)\s+LN:(\d+)/) { print $1, "\t", $2, "\n"}' > lengths.genome

Step 6: Sort BAM file. Many of the downstream analysis programs that use BAM files actually require a sorted BAM file. -m specifies the maximum memory to use, and can be changed to fit your system.

[uzi@quince-srv2 ~/oneliners]$ samtools sort -m 1000000000 aln-pe.mapped.bam aln-pe.mapped.sorted

Step 7: We will now use bedtools. It is a very useful suite of programs for working with SAM/BAM, BED, VCF and GFF files, files that you will encouter many times doing NGS analysis. -ibam switch takes indexed bam file that we generated earlier, -d reports the depth at each genome position with 1-based coordinates, and -g used the genome lengths file we generated earlier

[uzi@quince-srv2 ~/oneliners]$ bedtools genomecov -ibam aln-pe.mapped.sorted.bam -d -g lengths.genome  > aln-pe.mapped.bam.perbase.cov

Step 8: Look at the first few entries in the file generated above. First column is genome identifier, second column is position on genome, third column is coverage.

[uzi@quince-srv2 ~/oneliners]$ head aln-pe.mapped.bam.perbase.cov

Acidobacterium_capsulatum_ATCC_51196	1	8
Acidobacterium_capsulatum_ATCC_51196	2	8
Acidobacterium_capsulatum_ATCC_51196	3	8
Acidobacterium_capsulatum_ATCC_51196	4	9
Acidobacterium_capsulatum_ATCC_51196	5	9
Acidobacterium_capsulatum_ATCC_51196	6	9
Acidobacterium_capsulatum_ATCC_51196	7	9
Acidobacterium_capsulatum_ATCC_51196	8	9
Acidobacterium_capsulatum_ATCC_51196	9	9
Acidobacterium_capsulatum_ATCC_51196	10	12

Step 9: Now we will count only those positions where we have >0 coverage.

[uzi@quince-srv2 ~/oneliners]$ awk -F"\t" '$3>0{print $1}' aln-pe.mapped.bam.perbase.cov | sort | uniq -c > aln-pe.mapped.bam.perbase.count

Step 10: To see what we have done, use the cat command

[uzi@quince-srv2 ~/oneliners]$ cat aln-pe.mapped.bam.perbase.count
4126255 Acidobacterium_capsulatum_ATCC_51196
2664070 Akkermansia_muciniphila_ATCC_BAA-835
2176909 Archaeoglobus_fulgidus_DSM_4304
6259688 Bacteroides_thetaiotaomicron_VPI-5482
5162898 Bacteroides_vulgatus_ATCC_8482
5334732 Bordetella_bronchiseptica_strain_RB50
6779108 Burkholderia1_xenovorans_LB400_chromosome_1,_complete_sequence
2968320 Caldisaccharolyticus_DSM_8903
2759841 Chlorobiumlimicola_DSM_245
3133587 Chlorobiumphaeobacteroides_DSM_266
1966851 Chlorobiumphaeovibrioides_DSM_265
2154672 Chlorobiumtepidum_TLS
5248942 Chloroflexus_aurantiacus_J-10-fl
3841892 Clostridium_thermocellum_ATCC_27405
3048801 Deinococcus_radiodurans_R1_chromosome_1,_complete_sequence
2796764 Desulfovibrio_piger_ATCC_29098
1819007 Dictyoglomus_turgidum_DSM_6724
2520796 Enterococcus_faecalis_V583
1458832 Fusobacterium_nucleatum_subsp._nucleatum_ATCC_25586
4636091 Gemmatimonas_aurantiaca_T-27_DNA
  77854 gi|220903286|ref|NC_011883.1|
2909411 gi|222528057|ref|NC_012034.1|
4920011 gi|307128764|ref|NC_014500.1|
1739058 gi|55979969|ref|NC_006461.1|
   7919 gi|83591340|ref|NC_007643.1|
6345359 Herpetosiphon_aurantiacus_ATCC_23779
1558912 Hydrogenobaculum_sp._Y04AAS1
1278529 Ignicoccus_hospitalis_KIN4/I
4901664 Leptothrix_cholodnii_SP-6
1634756 Methanocaldococcus_jannaschii_DSM_2661
1770072 Methanococcus_maripaludis_C5
1650606 Methanococcus_maripaludis_strain_S2,_complete_sequence
 459768 Nanoarchaeum_equitans_Kin4-M
2798863 Nitrosomonas_europaea_ATCC_19718
6409848 Nostoc_sp._PCC_7120_DNA
3017983 Pelodictyon_phaeoclathratiforme_BU-1
1929147 Persephonella_marina_EX-H1
2354658 Porphyromonas_gingivalis_ATCC_33277_DNA
2219120 Pyrobaculum_aerophilum_str._IM2
1998126 Pyrobaculum_calidifontis_JCM_11548
1738110 Pyrococcus_horikoshii_OT3_DNA
7145136 Rhodopirellula_baltica_SH_1_complete_genome
4106385 Ruegeria_pomeroyi_DSS-3
5762894 Salinispora_arenicola_CNS-205
5168083 Salinispora_tropica_CNB-440
5229615 Shewanella_baltica_OS185
5145817 Shewanella_baltica_OS223,
3478897 Sulfitobacter_NAS-14.1_scf_1099451320477_
3029105 Sulfitobacter_sp._EE-36_scf_1099451318008_
2666279 Sulfolobus_tokodaii
1520314 Sulfurihydrogenibium_yellowstonense_SS-5
1828687 SulfuriYO3AOP1
2361532 Thermoanaerobacter_pseudethanolicus_ATCC_33223
1884531 Thermotoga_neapolitana_DSM_4359
1823470 Thermotoga_petrophila_RKU-1
1877693 Thermotoga_sp._RQ2
2829658 Treponema_denticola_ATCC_35405
2274638 Treponema_vincentii_ATCC_35580_NZACYH00000000.1
2052189 Zymomonas_mobilis_subsp._mobilis_ZM4

Step 11: We will now use the above file with lengths.genome to calculate the proportions using the following one-liner. It reads lengths.genome line by line, assigns the genome identifier to myArray[0] and it's length to myArray[1]. It then searches the identifier in aln-pe.mapped.bam.perbase.count, extracts the base count, and uses bc to calculate the proportion.

[uzi@quince-srv2 ~/oneliners]$ while IFS=$'\t' read -r -a myArray; do echo -e "${myArray[0]},$( echo "scale=5;"$(awk -v pattern="${myArray[0]}" '$2==pattern{print $1}' aln-pe.mapped.bam.perbase.count)"/"${myArray[1]} | bc ) "; done < lengths.genome > aln-pe.mapped.bam.genomeproportion

Step 12: Some of the reference genomes were downloaded from NCBI. We will use a file IDs_metagenomes2.txt that contains the meaningful mapping from accession numbers to genome names.

[uzi@quince-srv2 ~/oneliners]$ cat IDs_metagenomes2.txt
gi|220903286|ref|NC_011883.1|,Desulfovibrio_desulfuricans
gi|222528057|ref|NC_012034.1|,Caldicellulosiruptor_bescii
gi|307128764|ref|NC_014500.1|,Dickeya_dadantii
gi|55979969|ref|NC_006461.1|,Thermus_thermophilus
gi|83591340|ref|NC_007643.1|,Rhodospirillum_rubrum

Step 13: Download my annotation script convIDs.pl and then use IDs_metagenomes2.txt to annotate column 1.

[uzi@quince-srv2 ~/oneliners]$ perl convIDs.pl -i aln-pe.mapped.bam.genomeproportion -l IDs_metagenomes2.txt -c 1 -t comma > aln-pe.mapped.bam.genomeproportion.annotated

Step 14: We have a total of 59 genomes in the reference database. To see how many genomes we recovered, we will use the following one-liner:

[uzi@quince-srv2 ~/oneliners]$ awk -F "," '{sum+=$NF} END{print "Total genomes covered:"sum}' aln-pe.mapped.bam.genomeproportion
Total genomes covered:55.8055

Step 15: Now we will identify those genomes for which the proportions are less than 0.99.

[uzi@quince-srv2 ~/oneliners]$ awk -F "," '$NF<=0.99{print $0}' aln-pe.mapped.bam.genomeproportion.annotated 
Burkholderia1_xenovorans_LB400_chromosome_1,_complete_sequence,.69664 
Desulfovibrio_desulfuricans,.02709 
Desulfovibrio_piger_ATCC_29098,.98358 
Dictyoglomus_turgidum_DSM_6724,.98030 
Enterococcus_faecalis_V583,.78333 
Fusobacterium_nucleatum_subsp._nucleatum_ATCC_25586,.67088 
Ignicoccus_hospitalis_KIN4/I,.98534 
Methanocaldococcus_jannaschii_DSM_2661,.98185 
Nanoarchaeum_equitans_Kin4-M,.93661 
Rhodospirillum_rubrum,.00181 
Sulfolobus_tokodaii,.98943 
Thermus_thermophilus,.94016 
Treponema_vincentii_ATCC_35580_NZACYH00000000.1,.90457

Step 16: The complete list is as follows:

[uzi@quince-srv2 ~/oneliners]$ cat aln-pe.mapped.bam.genomeproportion.annotated
Acidobacterium_capsulatum_ATCC_51196,.99973 
Akkermansia_muciniphila_ATCC_BAA-835,.99998 
Archaeoglobus_fulgidus_DSM_4304,.99931 
Bacteroides_thetaiotaomicron_VPI-5482,.99989 
Bacteroides_vulgatus_ATCC_8482,.99994 
Bordetella_bronchiseptica_strain_RB50,.99916 
Burkholderia1_xenovorans_LB400_chromosome_1,_complete_sequence,.69664 
Caldicellulosiruptor_bescii,.99646 
Caldisaccharolyticus_DSM_8903,.99934 
Chlorobiumlimicola_DSM_245,.99879 
Chlorobiumphaeobacteroides_DSM_266,.99989 
Chlorobiumphaeovibrioides_DSM_265,.99999 
Chlorobiumtepidum_TLS,.99987 
Chloroflexus_aurantiacus_J-10-fl,.99817 
Clostridium_thermocellum_ATCC_27405,.99963 
Deinococcus_radiodurans_R1_chromosome_1,_complete_sequence,.99601 
Desulfovibrio_desulfuricans,.02709 
Desulfovibrio_piger_ATCC_29098,.98358 
Dickeya_dadantii,.99943 
Dictyoglomus_turgidum_DSM_6724,.98030 
Enterococcus_faecalis_V583,.78333 
Fusobacterium_nucleatum_subsp._nucleatum_ATCC_25586,.67088 
Gemmatimonas_aurantiaca_T-27_DNA,.99981 
Herpetosiphon_aurantiacus_ATCC_23779,.99980 
Hydrogenobaculum_sp._Y04AAS1,.99961 
Ignicoccus_hospitalis_KIN4/I,.98534 
Leptothrix_cholodnii_SP-6,.99842 
Methanocaldococcus_jannaschii_DSM_2661,.98185 
Methanococcus_maripaludis_C5,.99399 
Methanococcus_maripaludis_strain_S2,_complete_sequence,.99366 
Nanoarchaeum_equitans_Kin4-M,.93661 
Nitrosomonas_europaea_ATCC_19718,.99529 
Nostoc_sp._PCC_7120_DNA,.99938 
Pelodictyon_phaeoclathratiforme_BU-1,.99991 
Persephonella_marina_EX-H1,.99941 
Porphyromonas_gingivalis_ATCC_33277_DNA,.99990 
Pyrobaculum_aerophilum_str._IM2,.99851 
Pyrobaculum_calidifontis_JCM_11548,.99443 
Pyrococcus_horikoshii_OT3_DNA,.99977 
Rhodopirellula_baltica_SH_1_complete_genome,.99993 
Rhodospirillum_rubrum,.00181 
Ruegeria_pomeroyi_DSS-3,.99925 
Salinispora_arenicola_CNS-205,.99594 
Salinispora_tropica_CNB-440,.99705 
Shewanella_baltica_OS185,.99998 
Shewanella_baltica_OS223,,.99998 
Sulfitobacter_NAS-14.1_scf_1099451320477_,.99697 
Sulfitobacter_sp._EE-36_scf_1099451318008_,.99921 
Sulfolobus_tokodaii,.98943 
Sulfurihydrogenibium_yellowstonense_SS-5,.99087 
SulfuriYO3AOP1,.99469 
Thermoanaerobacter_pseudethanolicus_ATCC_33223,.99945 
Thermotoga_neapolitana_DSM_4359,.99998 
Thermotoga_petrophila_RKU-1,.99997 
Thermotoga_sp._RQ2,1.00000 
Thermus_thermophilus,.94016 
Treponema_denticola_ATCC_35405,.99523 
Treponema_vincentii_ATCC_35580_NZACYH00000000.1,.90457

Variant Calling

It is assumed that BWA, Samtools, Picard Tools, Interactive Genome Viewer, GATK, and VCFtools are installed and available in the system $PATH variable. I have a sample from Clostridium Difficile 630 from which I will summarize variants (SNPs).
Index reference database downloaded from above link:

[uzi@quince-srv2 ~/oneliners/SNPs]$ bwa index ref.fa

Generate a SAM file by aligning paired-end reads against the reference database

[uzi@quince-srv2 ~/oneliners/SNPs]$ bwa mem ref.fasta forward.fastq reverse.fastq > aln-pe.sam

Convert SAM to BAM file

[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools view -h -b -S aln-pe.sam > aln-pe.bam

Sort BAM file

[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools sort -m 1000000000 aln-pe.bam aln-pe.sorted

Now we will check alignment statistics using Picard tools

[uzi@quince-srv2 ~/oneliners/SNPs]$ java -jar $(which CollectAlignmentSummaryMetrics.jar) INPUT=aln-pe.sorted.bam OUTPUT=aln-pe.sorted.alignment_stats.txt REFERENCE_SEQUENCE=ref.fa
[uzi@quince-srv2 ~/oneliners/SNPs]$ grep -vi -e "^#" -e "^$" aln-pe.sorted.alignment_stats.txt
CATEGORY	TOTAL_READS	PF_READS	PCT_PF_READS	PF_NOISE_READS	PF_READS_ALIGNED	PCT_PF_READS_ALIGNED	PF_ALIGNED_BASES	PF_HQ_ALIGNED_READS	PF_HQ_ALIGNED_BASES	PF_HQ_ALIGNED_Q20_BASES	PF_HQ_MEDIAN_MISMATCHES	PF_MISMATCH_RATE	PF_HQ_ERROR_RATE	PF_INDEL_RATE	MEAN_READ_LENGTH	READS_ALIGNED_IN_PAIRS	PCT_READS_ALIGNED_IN_PAIRS	BAD_CYCLES	STRAND_BALANCE	PCT_CHIMERAS	PCT_ADAPTER	SAMPLE	LIBRARY	READ_GROUP
FIRST_OF_PAIR	333371	333371	1	0	313569	0.940601	90716605	301550	88367790	87254618	2	0.0105	0.010296	0.000251	298.542723	311278	0.992694	0.501912	0.012613	0.000183			
SECOND_OF_PAIR	333265	333265	1	0	312663	0.938181	89921348	300915	87620763	84183606	2	0.013972	0.013701	0.000256	298.30708	311185	0.995273	0	0.500162	0.012173	0.000159			
PAIR	666636	666636	1	0	626232	0.939391	180637953	602465	175988553	171438224	2	0.012228	0.011991	0.000253	298.42492	622463	0.993981	0.501038	0.012393	0.000171

The detailed description of these summary metrics are given here. From this link, the following metrics are of our interest:
PF_MISMATCH_RATE: The rate of bases mismatching the reference for all bases aligned to the reference sequence
PF_HQ_ERROR_RATE: The percentage of bases that mismatch the reference in PF HQ aligned reads
PF_INDEL_RATE: The number of insertion and deletion events per 100 aligned bases. Uses the number of events as the numerator, not the number of inserted or deleted bases

[uzi@quince-srv2 ~/oneliners/SNPs]$  grep -vi -e "^#" -e "^$" aln-pe.sorted.alignment_stats.txt | perl -alne 'print join("\t",@F[0,12,13,14])'
CATEGORY	PF_MISMATCH_RATE	PF_HQ_ERROR_RATE	PF_INDEL_RATE
FIRST_OF_PAIR	0.0105			0.010296		0.000251
SECOND_OF_PAIR	0.013972		0.013701		0.000256
PAIR		0.012228		0.011991		0.000253

As can be seen, the error rates are quite low and we can proceed with the analysis.
Extract only those sequences that were mapped against the reference database. Use -F 4 switch:

[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools view -b -F 4 aln-pe.bam > aln-pe.mapped.bam

Sort mapped reads BAM file

[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools sort -m 1000000000 aln-pe.mapped.bam aln-pe.mapped.sorted.bam

Now, we will assess the quality of mapped reads by generating charts for GC Bias, Mean Quality by Cycle, and Quality Score Distribution

[uzi@quince-srv2 ~/oneliners/SNPs]$ java -Xmx2g -jar $(which CollectGcBiasMetrics.jar) R=ref.fa I=aln-pe.mapped.sorted.bam O=aln-pe.mapped.sorted_GCBias.txt CHART=aln-pe.mapped.sorted_GCBias.pdf ASSUME_SORTED=true
[uzi@quince-srv2 ~/oneliners/SNPs]$ java -Xmx2g -jar $(which MeanQualityByCycle.jar) INPUT=aln-pe.mapped.sorted.bam OUTPUT=aln-pe.mapped.sorted.mqc.txt CHART_OUTPUT=aln-pe.mapped.sorted.mqc.pdf
[uzi@quince-srv2 ~/oneliners/SNPs]$ java -Xmx2g -jar $(which QualityScoreDistribution.jar) INPUT=aln-pe.mapped.sorted.bam OUTPUT=aln-pe.mapped.sorted.qsd.txt CHART_OUTPUT=aln-pe.mapped.sorted.qsd.pdf

We will use two approaches to call SNPs

Calling SNPs with samtools mpileup

To generate a BCF file (binary data format used to hold information about sequence variants such as SNPs), run the following lines

[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools mpileup -uD -f ref.fa aln-pe.mapped.sorted.bam | bcftools view -bvcg - > aln-pe.mapped.sorted.raw.bcf

where -u tells samtools to output into an uncompressed bcf file, -D tells it to keep read depth for each sample, and -f specifies the reference genome file. Samtools calculate the genotype likelihoods and the bcftools does SNP calling based on those likelihoods. The -b flag in bcftools tells it to output to BCF format instead of VCF format, -c tells it to do SNP calling, -v tells it to only output potential variant sites, and -g tells it to call genotypes for each sample in addition to just calling SNPs.
Now run

[uzi@quince-srv2 ~/oneliners/SNPs]$ bcftools view aln-pe.mapped.sorted.raw.bcf | vcfutils.pl varFilter -D100 > aln-pe.mapped.sorted.vcf

This line converts the binary BCF file into a text VCF file. We then pipe that into vcfutils.pl with the varFilter -D100 option, which filters out SNPs that had read depth higher than 100 (as sites with higher coverage than that represent variation between variable copy number repeats). Depending on the coverage you have in the dataset you can change this number, for example, you can use -D500. Samtools detect SNPs by checking number of different reads that share a mis-match from the reference, the sequence quality data, and the expected sequencing error rates at a given site. Now we display first few records of the VCF file. You can get more information on the format from here.

[uzi@quince-srv2 ~/oneliners/SNPs]$ head -50 aln-pe.mapped.sorted.vcf

##fileformat=VCFv4.1
##samtoolsVersion=0.1.18 (r982:295)
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	aln-pe.mapped.sorted.bam
Cdiff630	1047	.	C	T	216	.	DP=60;VDB=0.0135;AF1=1;AC1=2;DP4=0,0,33,25;MQ=60;FQ=-202	GT:PL:DP:GQ	1/1:249,175,0:58:99
Cdiff630	1960	.	C	T	177	.	DP=64;VDB=0.0137;AF1=1;AC1=2;DP4=0,0,27,37;MQ=60;FQ=-220	GT:PL:DP:GQ	1/1:210,193,0:64:99
Cdiff630	2107	.	G	A	216	.	DP=58;VDB=0.0136;AF1=1;AC1=2;DP4=0,0,29,28;MQ=60;FQ=-199	GT:PL:DP:GQ	1/1:249,172,0:57:99
Cdiff630	2206	.	A	G	205	.	DP=55;VDB=0.0134;AF1=1;AC1=2;DP4=0,0,23,30;MQ=60;FQ=-187	GT:PL:DP:GQ	1/1:238,160,0:53:99
Cdiff630	2212	.	T	C	222	.	DP=56;VDB=0.0134;AF1=1;AC1=2;DP4=0,0,24,31;MQ=60;FQ=-193	GT:PL:DP:GQ	1/1:255,166,0:55:99
Cdiff630	2332	.	T	C	222	.	DP=50;VDB=0.0131;AF1=1;AC1=2;DP4=0,0,23,25;MQ=60;FQ=-171	GT:PL:DP:GQ	1/1:255,144,0:48:99
Cdiff630	2443	.	T	C	222	.	DP=59;VDB=0.0136;AF1=1;AC1=2;DP4=0,0,32,26;MQ=60;FQ=-202	GT:PL:DP:GQ	1/1:255,175,0:58:99
Cdiff630	2979	.	G	A	159	.	DP=37;VDB=0.0133;AF1=1;AC1=2;DP4=0,0,20,15;MQ=60;FQ=-132	GT:PL:DP:GQ	1/1:192,105,0:35:99
Cdiff630	3235	.	T	C	222	.	DP=37;VDB=0.0132;AF1=1;AC1=2;DP4=0,0,21,14;MQ=60;FQ=-132	GT:PL:DP:GQ	1/1:255,105,0:35:99
Cdiff630	3304	.	A	G	55.1	.	DP=38;VDB=0.0136;AF1=1;AC1=2;DP4=0,0,5,2;MQ=60;FQ=-48	GT:PL:DP:GQ	1/1:88,21,0:7:39
Cdiff630	3385	.	G	A	169	.	DP=46;VDB=0.0121;AF1=1;AC1=2;DP4=0,0,22,23;MQ=60;FQ=-162	GT:PL:DP:GQ	1/1:202,135,0:45:99
Cdiff630	5602	.	A	G	149	.	DP=45;VDB=0.0105;AF1=1;AC1=2;DP4=0,1,19,25;MQ=60;FQ=-144;PV4=1,1,1,1	GT:PL:DP:GQ	1/1:182,117,0:45:99
Cdiff630	6034	.	C	T,G	203	.	DP=70;VDB=0.0134;AF1=1;AC1=2;DP4=0,0,37,31;MQ=60;FQ=-226	GT:PL:DP:GQ	1/1:236,199,0,239,168,233:68:99
Cdiff630	6310	.	C	T	217	.	DP=62;VDB=0.0113;AF1=1;AC1=2;DP4=0,0,27,32;MQ=60;FQ=-205	GT:PL:DP:GQ	1/1:250,178,0:59:99
Cdiff630	6812	.	G	A	222	.	DP=60;VDB=0.0135;AF1=1;AC1=2;DP4=0,0,32,26;MQ=60;FQ=-202	GT:PL:DP:GQ	1/1:255,175,0:58:99
Cdiff630	7281	.	C	A	222	.	DP=43;VDB=0.0126;AF1=1;AC1=2;DP4=0,0,22,21;MQ=60;FQ=-156	GT:PL:DP:GQ	1/1:255,129,0:43:99
Cdiff630	7427	.	G	A	195	.	DP=41;VDB=0.0134;AF1=1;AC1=2;DP4=0,0,18,19;MQ=60;FQ=-138	GT:PL:DP:GQ	1/1:228,111,0:37:99
Cdiff630	7467	.	G	A	162	.	DP=39;VDB=0.0134;AF1=1;AC1=2;DP4=0,0,19,19;MQ=60;FQ=-141	GT:PL:DP:GQ	1/1:195,114,0:38:99
Cdiff630	7976	.	A	T	222	.	DP=50;VDB=0.0133;AF1=1;AC1=2;DP4=0,0,26,24;MQ=60;FQ=-178	GT:PL:DP:GQ	1/1:255,151,0:50:99
Cdiff630	8171	.	T	C	222	.	DP=42;VDB=0.0135;AF1=1;AC1=2;DP4=1,0,18,21;MQ=60;FQ=-126;PV4=0.48,1,1,1	GT:PL:DP:GQ	1/1:255,99,0:40:99
Cdiff630	8698	.	T	C	215	.	DP=49;VDB=0.0129;AF1=1;AC1=2;DP4=0,0,27,21;MQ=60;FQ=-171	GT:PL:DP:GQ	1/1:248,144,0:48:99
Cdiff630	8770	.	T	C	222	.	DP=48;VDB=0.0133;AF1=1;AC1=2;DP4=0,0,26,21;MQ=60;FQ=-168	GT:PL:DP:GQ	1/1:255,141,0:47:99
Cdiff630	9394	.	G	A	153	.	DP=54;VDB=0.0136;AF1=1;AC1=2;DP4=0,0,21,31;MQ=60;FQ=-184	GT:PL:DP:GQ	1/1:186,157,0:52:99

Calling SNPs with GATK's Unified Genotyper

The GATK developers have recommended local re-alignment around indels as a pre-processing step, because reads whose ends map to the location of an indel can sometimes lead to false positive SNP calls. In GATK, we can do local re-alignment around potential indel sites, in which we incorporate information about all the reads in that region simultaneously, rather than mapping each read individually. To use GATK, we have to generate a dictionary file from the database, fix our BAM file by including read groups information otherwise GATK will throw the "SAM/BAM file is malformed" error. We also need to index the resulting file. These steps are as follows:

[uzi@quince-srv2 ~/oneliners/SNPs]$ java -jar $(which CreateSequenceDictionary.jar) R=ref.fa O=ref.dict
[uzi@quince-srv2 ~/oneliners/SNPs]$ java -jar $(which AddOrReplaceReadGroups.jar) I=aln-pe.mapped.sorted.bam O=aln-pe.mapped.sorted.grps.bam LB=whatever PL=illumina PU=whatever SM=whatever
[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools index aln-pe.mapped.sorted.grps.bam

First, GATK needs to figure out which regions are in need of re-alignment:

[uzi@quince-srv2 ~/oneliners/SNPs]$ java -Xmx2g -jar $(which GenomeAnalysisTK.jar) -T RealignerTargetCreator -R ref.fa -I aln-pe.mapped.sorted.grps.bam -o aln-pe.mapped.sorted.grps.realign.intervals

And then it needs to actually do the re-alignment

[uzi@quince-srv2 ~/oneliners/SNPs]$ java -Xmx4g -jar $(which GenomeAnalysisTK.jar) -I aln-pe.mapped.sorted.grps.bam -R ref.fa -T IndelRealigner -targetIntervals aln-pe.mapped.sorted.grps.realign.intervals -o aln-pe.mapped.sorted.grps.realigned.bam

We again fix the resulting BAM file

[uzi@quince-srv2 ~/oneliners/SNPs]$ java -jar $(which AddOrReplaceReadGroups.jar) I=aln-pe.mapped.sorted.grps.realigned.bam O=aln-pe.mapped.sorted.grps.realigned.grps.bam LB=whatever PL=illumina PU=whatever SM=whatever
[uzi@quince-srv2 ~/oneliners/SNPs]$ samtools index aln-pe.mapped.sorted.grps.realigned.grps.bam

We now do the SNP calling on the resultant file

[uzi@quince-srv2 ~/oneliners/SNPs]$ java -jar $(which GenomeAnalysisTK.jar) -R ref.fa -T UnifiedGenotyper -I aln-pe.mapped.sorted.grps.realigned.grps.bam -o aln-pe.mapped.sorted.grps.realigned.grps.vcf -stand_call_conf 50.0 -stand_emit_conf 10.0 -dcov 500 -rf BadCigar

The parameters are slightly different than those used in samtools mpileup. -stand_emit_conf 10.0 means dont report any potential SNPs with a quality below 10.0; but unless they meet the quality threshold set by -stand_call_conf (50.0, in this case), they will be listed as failing the quality filter. -dcov 500 means that any site that has more than 500x coverage, the genotype caller will only use 500 randomly selected reads for computational efficiency.
Now, we check first few records of the generated VCF file.

[uzi@quince-srv2 ~/oneliners/SNPs]$ head -50 aln-pe.mapped.sorted.grps.realigned.grps.vcf 
##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##GATKCommandLine=<ID=UnifiedGenotyper,Version=3.1-1-g07a4bf8,Date="Mon Apr 07 01:40:13 BST 2014",Epoch=1396831213225,CommandLineOptions="analysis_type=UnifiedGenotyper input_file=[aln-pe.mapped.sorted.grps.realigned.grps.bam] showFullBamList=false read_buffer_size=null phone_home=AWS gatk_key=null tag=NA read_filter=[BadCigar] intervals=null excludeIntervals=null interval_set_rule=UNION interval_merging=ALL interval_padding=0 reference_sequence=ref.fa nonDeterministicRandomSeed=false disableDithering=false maxRuntime=-1 maxRuntimeUnits=MINUTES downsampling_type=BY_SAMPLE downsample_to_fraction=null downsample_to_coverage=500 baq=OFF baqGapOpenPenalty=40.0 fix_misencoded_quality_scores=false allow_potentially_misencoded_quality_scores=false useOriginalQualities=false defaultBaseQualities=-1 performanceLog=null BQSR=null quantize_quals=0 disable_indel_quals=false emit_original_quals=false preserve_qscores_less_than=6 globalQScorePrior=-1.0 validation_strictness=SILENT remove_program_records=false keep_program_records=false sample_rename_mapping_file=null unsafe=null disable_auto_index_creation_and_locking_when_reading_rods=false num_threads=1 num_cpu_threads_per_data_thread=1 num_io_threads=0 monitorThreadEfficiency=false num_bam_file_handles=null read_group_black_list=null pedigree=[] pedigreeString=[] pedigreeValidationType=STRICT allow_intervals_with_unindexed_bam=false generateShadowBCF=false variant_index_type=DYNAMIC_SEEK variant_index_parameter=-1 logging_level=INFO log_to_file=null help=false version=false genotype_likelihoods_model=SNP pcr_error_rate=1.0E-4 computeSLOD=false annotateNDA=false pair_hmm_implementation=LOGLESS_CACHING min_base_quality_score=17 max_deletion_fraction=0.05 allSitePLs=false min_indel_count_for_genotyping=5 min_indel_fraction_per_sample=0.25 indelGapContinuationPenalty=10 indelGapOpenPenalty=45 indelHaplotypeSize=80 indelDebug=false ignoreSNPAlleles=false allReadsSP=false ignoreLaneInfo=false reference_sample_calls=(RodBinding name= source=UNBOUND) reference_sample_name=null sample_ploidy=2 min_quality_score=1 max_quality_score=40 site_quality_prior=20 min_power_threshold_for_calling=0.95 min_reference_depth=100 exclude_filtered_reference_sites=false output_mode=EMIT_VARIANTS_ONLY heterozygosity=0.001 indel_heterozygosity=1.25E-4 genotyping_mode=DISCOVERY standard_min_confidence_threshold_for_calling=50.0 standard_min_confidence_threshold_for_emitting=10.0 alleles=(RodBinding name= source=UNBOUND) max_alternate_alleles=6 input_prior=[] contamination_fraction_to_filter=0.0 contamination_fraction_per_sample_file=null p_nonref_model=EXACT_INDEPENDENT exactcallslog=null dbsnp=(RodBinding name= source=UNBOUND) comp=[] out=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub no_cmdline_in_header=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub sites_only=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub bcf=org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub onlyEmitSamples=[] debug_file=null metrics_file=null annotation=[] excludeAnnotation=[] filter_reads_with_N_cigar=false filter_mismatching_base_and_quals=false filter_bases_not_stored=false">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=A,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=A,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RPA,Number=.,Type=Integer,Description="Number of times tandem repeat unit is repeated, for each allele (including reference)">
##INFO=<ID=RU,Number=1,Type=String,Description="Tandem repeat unit (bases)">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=STR,Number=0,Type=Flag,Description="Variant is a short tandem repeat">
##contig=<ID=Cdiff630,length=4290252>
##reference=file:///home/uzi/oneliners/SNPs/ref.fa
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	whatever
Cdiff630	1047	.	C	T	1896.77	.	AC=2;AF=1.00;AN=2;DP=60;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.61	GT:AD:DP:GQ:PL	1/1:0,60:60:99:1925,141,0
Cdiff630	1960	.	C	T	2161.77	.	AC=2;AF=1.00;AN=2;DP=64;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.78	GT:AD:DP:GQ:PL	1/1:0,64:64:99:2190,163,0
Cdiff630	2107	.	G	A	1907.77	.	AC=2;AF=1.00;AN=2;DP=58;Dels=0.00;FS=0.000;HaplotypeScore=3.5489;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.89	GT:AD:DP:GQ:PL	1/1:0,58:58:99:1936,144,0
Cdiff630	2206	.	A	G	1826.77	.	AC=2;AF=1.00;AN=2;DP=55;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.21	GT:AD:DP:GQ:PL	1/1:0,55:55:99:1855,138,0
Cdiff630	2212	.	T	C	1841.77	.	AC=2;AF=1.00;AN=2;DP=56;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.89	GT:AD:DP:GQ:PL	1/1:0,56:56:99:1870,138,0
Cdiff630	2332	.	T	C	1632.77	.	AC=2;AF=1.00;AN=2;DP=50;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.66	GT:AD:DP:GQ:PL	1/1:0,50:50:99:1661,123,0
Cdiff630	2443	.	T	C	2011.77	.	AC=2;AF=1.00;AN=2;DP=59;Dels=0.00;FS=0.000;HaplotypeScore=3.4709;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=34.10	GT:AD:DP:GQ:PL	1/1:0,59:59:99:2040,150,0
Cdiff630	2979	.	G	A	1155.77	.	AC=2;AF=1.00;AN=2;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.24	GT:AD:DP:GQ:PL	1/1:0,37:37:87:1184,87,0
Cdiff630	3235	.	T	C	1249.77	.	AC=2;AF=1.00;AN=2;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.78	GT:AD:DP:GQ:PL	1/1:0,37:37:96:1278,96,0
Cdiff630	3304	.	A	G	1269.77	.	AC=2;AF=1.00;AN=2;DP=38;Dels=0.00;FS=0.000;HaplotypeScore=0.9469;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.41	GT:AD:DP:GQ:PL	1/1:0,38:38:96:1298,96,0
Cdiff630	3385	.	G	A	1400.77	.	AC=2;AF=1.00;AN=2;DP=46;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=30.45	GT:AD:DP:GQ:PL	1/1:0,46:46:99:1429,105,0
Cdiff630	5602	.	A	G	1362.77	.	AC=2;AF=1.00;AN=2;BaseQRankSum=1.270;DP=45;Dels=0.00;FS=0.000;HaplotypeScore=0.9992;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=0.115;QD=30.28;ReadPosRankSum=1.578	GT:AD:DP:GQ:PL	1/1:1,44:45:99:1391,104,0
Cdiff630	6034	.	C	T	1983.77	.	AC=2;AF=1.00;AN=2;DP=70;Dels=0.00;FS=0.000;HaplotypeScore=1.8317;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.34	GT:AD:DP:GQ:PL	1/1:0,68:70:99:2012,150,0
Cdiff630	6310	.	C	T	2068.77	.	AC=2;AF=1.00;AN=2;DP=62;Dels=0.00;FS=0.000;HaplotypeScore=1.7882;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.37	GT:AD:DP:GQ:PL	1/1:0,62:62:99:2097,159,0
Cdiff630	6812	.	G	A	1894.77	.	AC=2;AF=1.00;AN=2;DP=60;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.58	GT:AD:DP:GQ:PL	1/1:0,60:60:99:1923,141,0
Cdiff630	7281	.	C	A	1343.77	.	AC=2;AF=1.00;AN=2;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.25	GT:AD:DP:GQ:PL	1/1:0,43:43:99:1372,99,0
Cdiff630	7427	.	G	A	1173.77	.	AC=2;AF=1.00;AN=2;DP=41;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.63	GT:AD:DP:GQ:PL	1/1:0,41:41:87:1202,87,0
Cdiff630	7467	.	G	A	1138.77	.	AC=2;AF=1.00;AN=2;DP=39;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=29.20	GT:AD:DP:GQ:PL	1/1:0,39:39:84:1167,84,0
Cdiff630	7976	.	A	T	1629.77	.	AC=2;AF=1.00;AN=2;DP=50;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.60	GT:AD:DP:GQ:PL	1/1:0,50:50:99:1658,123,0

We can check how many SNPs are returned by both approaches:

[uzi@quince-srv2 ~/oneliners/SNPs]$ grep -vc -e "^#" -e "INDEL" aln-pe.mapped.sorted.grps.realigned.grps.vcf
36365
[uzi@quince-srv2 ~/oneliners/SNPs]$ grep -vc -e "^#" -e "INDEL" aln-pe.mapped.sorted.vcf
31413
[uzi@quince-srv2 ~/oneliners/SNPs]$ grep -vc -e "^#" aln-pe.mapped.sorted.vcf

32093

We can use VCFtools which are very useful for manipulating and comparing VCF files. For example, the above step could be done using VCFtools as

[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.vcf

VCFtools - v0.1.12a
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf aln-pe.mapped.sorted.vcf

After filtering, kept 1 out of 1 Individuals
After filtering, kept 32093 out of a possible 32093 Sites
Run Time = 0.00 seconds
[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf

VCFtools - v0.1.12a
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf

After filtering, kept 1 out of 1 Individuals
After filtering, kept 36365 out of a possible 36365 Sites
Run Time = 0.00 seconds

Our genome is 4290252 bp long. We can even see how many SNPs are there in first 10000bps

[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf --chr Cdiff630 --from-bp 1 --to-bp 10000

VCFtools - v0.1.12a
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf
	--chr Cdiff630
	--to-bp 10000
	--from-bp 1

After filtering, kept 1 out of 1 Individuals
After filtering, kept 23 out of a possible 36365 Sites
Run Time = 0.00 seconds

VCFtools can perform analyses on the variants that pass through the filters or simply write those variants out to a new file. We can create subsets of VCF files or just removing unwanted variants from VCF files. To write out the variants that pass through filters use the --recode option. In addition, use --recode-INFO-all to include all data from the INFO fields in the output. It will generate a file out.recode.vcf in the current folder

[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf --chr Cdiff630 --from-bp 1 --to-bp 10000 --recode --recode-INFO-all
[uzi@quince-srv2 ~/oneliners/SNPs]$ grep -vi "^#" out.recode.vcf

Cdiff630	1047	.	C	T	1896.77	.	AC=2;AF=1.00;AN=2;DP=60;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.61	GT:AD:DP:GQ:PL	1/1:0,60:60:99:1925,141,0
Cdiff630	1960	.	C	T	2161.77	.	AC=2;AF=1.00;AN=2;DP=64;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.78	GT:AD:DP:GQ:PL	1/1:0,64:64:99:2190,163,0
Cdiff630	2107	.	G	A	1907.77	.	AC=2;AF=1.00;AN=2;DP=58;Dels=0.00;FS=0.000;HaplotypeScore=3.5489;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.89	GT:AD:DP:GQ:PL	1/1:0,58:58:99:1936,144,0
Cdiff630	2206	.	A	G	1826.77	.	AC=2;AF=1.00;AN=2;DP=55;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.21	GT:AD:DP:GQ:PL	1/1:0,55:55:99:1855,138,0
Cdiff630	2212	.	T	C	1841.77	.	AC=2;AF=1.00;AN=2;DP=56;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.89	GT:AD:DP:GQ:PL	1/1:0,56:56:99:1870,138,0
Cdiff630	2332	.	T	C	1632.77	.	AC=2;AF=1.00;AN=2;DP=50;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.66	GT:AD:DP:GQ:PL	1/1:0,50:50:99:1661,123,0
Cdiff630	2443	.	T	C	2011.77	.	AC=2;AF=1.00;AN=2;DP=59;Dels=0.00;FS=0.000;HaplotypeScore=3.4709;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=34.10	GT:AD:DP:GQ:PL	1/1:0,59:59:99:2040,150,0
Cdiff630	2979	.	G	A	1155.77	.	AC=2;AF=1.00;AN=2;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.24	GT:AD:DP:GQ:PL	1/1:0,37:37:87:1184,87,0
Cdiff630	3235	.	T	C	1249.77	.	AC=2;AF=1.00;AN=2;DP=37;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.78	GT:AD:DP:GQ:PL	1/1:0,37:37:96:1278,96,0
Cdiff630	3304	.	A	G	1269.77	.	AC=2;AF=1.00;AN=2;DP=38;Dels=0.00;FS=0.000;HaplotypeScore=0.9469;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.41	GT:AD:DP:GQ:PL	1/1:0,38:38:96:1298,96,0
Cdiff630	3385	.	G	A	1400.77	.	AC=2;AF=1.00;AN=2;DP=46;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=30.45	GT:AD:DP:GQ:PL	1/1:0,46:46:99:1429,105,0
Cdiff630	5602	.	A	G	1362.77	.	AC=2;AF=1.00;AN=2;BaseQRankSum=1.270;DP=45;Dels=0.00;FS=0.000;HaplotypeScore=0.9992;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=0.115;QD=30.28;ReadPosRankSum=1.578	GT:AD:DP:GQ:PL	1/1:1,44:45:99:1391,104,0
Cdiff630	6034	.	C	T	1983.77	.	AC=2;AF=1.00;AN=2;DP=70;Dels=0.00;FS=0.000;HaplotypeScore=1.8317;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.34	GT:AD:DP:GQ:PL	1/1:0,68:70:99:2012,150,0
Cdiff630	6310	.	C	T	2068.77	.	AC=2;AF=1.00;AN=2;DP=62;Dels=0.00;FS=0.000;HaplotypeScore=1.7882;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.37	GT:AD:DP:GQ:PL	1/1:0,62:62:99:2097,159,0
Cdiff630	6812	.	G	A	1894.77	.	AC=2;AF=1.00;AN=2;DP=60;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.58	GT:AD:DP:GQ:PL	1/1:0,60:60:99:1923,141,0
Cdiff630	7281	.	C	A	1343.77	.	AC=2;AF=1.00;AN=2;DP=43;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.25	GT:AD:DP:GQ:PL	1/1:0,43:43:99:1372,99,0
Cdiff630	7427	.	G	A	1173.77	.	AC=2;AF=1.00;AN=2;DP=41;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.63	GT:AD:DP:GQ:PL	1/1:0,41:41:87:1202,87,0
Cdiff630	7467	.	G	A	1138.77	.	AC=2;AF=1.00;AN=2;DP=39;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=29.20	GT:AD:DP:GQ:PL	1/1:0,39:39:84:1167,84,0
Cdiff630	7976	.	A	T	1629.77	.	AC=2;AF=1.00;AN=2;DP=50;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.60	GT:AD:DP:GQ:PL	1/1:0,50:50:99:1658,123,0
Cdiff630	8171	.	T	C	1353.77	.	AC=2;AF=1.00;AN=2;BaseQRankSum=1.568;DP=42;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=-1.155;QD=32.23;ReadPosRankSum=0.908	GT:AD:DP:GQ:PL	1/1:1,41:42:83:1382,83,0
Cdiff630	8698	.	T	C	1662.77	.	AC=2;AF=1.00;AN=2;DP=49;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=33.93	GT:AD:DP:GQ:PL	1/1:0,49:49:99:1691,126,0
Cdiff630	8770	.	T	C	1672.77	.	AC=2;AF=1.00;AN=2;DP=48;Dels=0.00;FS=0.000;HaplotypeScore=0.8941;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=34.85	GT:AD:DP:GQ:PL	1/1:0,48:48:99:1701,126,0
Cdiff630	9394	.	G	A	1741.77	.	AC=2;AF=1.00;AN=2;DP=55;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=31.67	GT:AD:DP:GQ:PL	1/1:0,54:54:99:1770,129,0

We can also determine which sites and individuals are shared between the VCF files generated from the two approaches

[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.vcf --diff aln-pe.mapped.sorted.grps.realigned.grps.vcf --out compare

This will generate two files compare.diff.indv_in_files and compare.diff.sites_in_files. Now check compare.diff.sites_in_files

[uzi@quince-srv2 ~/oneliners/SNPs]$ head compare.diff.sites_in_files

CHROM		POS	IN_FILE	REF	ALT1	ALT2
Cdiff630	1047	B	C	T	T
Cdiff630	1960	B	C	T	T
Cdiff630	2107	B	G	A	A
Cdiff630	2206	B	A	G	G
Cdiff630	2212	B	T	C	C
Cdiff630	2332	B	T	C	C
Cdiff630	2443	B	T	C	C
Cdiff630	2979	B	G	A	A
Cdiff630	3235	B	T	C	C

We can also summarizes the sequencing depth

[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf --depth -c > depth_summary.txt
[uzi@quince-srv2 ~/oneliners/SNPs]$ cat depth_summary.txt

INDV		N_SITES	MEAN_DEPTH
whatever	36365	44.2186

To write out site-wise sequence depths only at sites that have no missing data, include the --max-missing argument. This will produce site_depth_summary.ldepth in the current folder

[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf --site-depth --max-missing 1.0 --out site_depth_summary

VCFtools - v0.1.12a
(C) Adam Auton and Anthony Marcketta 2009

Parameters as interpreted:
	--vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf
	--max-missing 1
	--out site_depth_summary
	--site-depth

After filtering, kept 1 out of 1 Individuals
Outputting Depth for Each Site
After filtering, kept 36365 out of a possible 36365 Sites
Run Time = 0.00 seconds

[uzi@quince-srv2 ~/oneliners/SNPs]$ head site_depth_summary.ldepth 

CHROM		POS	SUM_DEPTH	SUMSQ_DEPTH
Cdiff630	1047	60		3600
Cdiff630	1960	64		4096
Cdiff630	2107	58		3364
Cdiff630	2206	55		3025
Cdiff630	2212	56		3136
Cdiff630	2332	50		2500
Cdiff630	2443	59		3481
Cdiff630	2979	37		1369
Cdiff630	3235	37		1369

VCFtools can also convert VCF files into other formats, for example, PED and MAP files

[uzi@quince-srv2 ~/oneliners/SNPs]$ vcftools --vcf aln-pe.mapped.sorted.grps.realigned.grps.vcf --plink --chr 1 --out output_in_plink

Finally, to view everything together, we can use Interactive Genome Viewer, which will require four files: ref.fa, aln-pe.mapped.sorted.grps.realigned.grps.vcf, aln-pe.mapped.sorted.grps.realigned.grps.bam, and aln-pe.mapped.sorted.grps.realigned.grps.bam.bai.

In the figure above, the red rectangles on the top panel are the variants we found using our analysis, and the bottom panel shows aligned reads to the genome region along with coverage information.

Getting linkage information (reads that map to multiple contigs/genomes) from SAM files

I have a SAM file that I generated by aligning paired-end reads against the reference database:

[uzi@quince-srv2 ~/oneliners]$ bwa mem ref.fasta forward.fastq reverse.fastq > aln-pe.sam

The resulting SAM file has 4806207 records (4 million records) and it took 2 minutes and 7.15 seconds to get the linkage information (i.e. reads that match multiple contigs/genomes) using the following one-liner:

[uzi@quince-srv2 ~/oneliners]$ awk '$2~/^99$|^147$|^83$|^163$|^67$|^131$|^115$|^179$|^81$|^161$|^97$|^145$|^65$|^129$|^113$|^177$/ && $2!~/^SN/{print $1"\t"$3"\t"$7}' aln-pe.sam | sort -n -k1,1 | uniq | awk 'BEGIN { FS="\t" } { c[$1]++; l[$1,c[$1]]=$0 } END { for (i in c) { if (c[i] > 1) for (j = 1; j <= c[i]; j++) print l[i,j] } }'
HWI-ST1242:77:C13H3ACXX:7:2206:11068:81469	Acidobacterium_capsulatum_ATCC_51196		Bordetella_bronchiseptica_strain_RB50
HWI-ST1242:77:C13H3ACXX:7:2206:11068:81469	Bordetella_bronchiseptica_strain_RB50		Acidobacterium_capsulatum_ATCC_51196
HWI-ST1242:77:C13H3ACXX:7:1306:17569:74649	Thermotoga_petrophila_RKU-1			Thermotoga_sp._RQ2
HWI-ST1242:77:C13H3ACXX:7:1306:17569:74649	Thermotoga_sp._RQ2				Thermotoga_petrophila_RKU-1
HWI-ST1242:77:C13H3ACXX:7:2101:13560:96472	Thermotoga_petrophila_RKU-1			Thermotoga_sp._RQ2
HWI-ST1242:77:C13H3ACXX:7:2101:13560:96472	Thermotoga_sp._RQ2				Thermotoga_petrophila_RKU-1
HWI-ST1242:77:C13H3ACXX:7:1209:7458:61056	Treponema_denticola_ATCC_35405		Treponema_vincentii_ATCC_35580_NZACYH00000000.1
HWI-ST1242:77:C13H3ACXX:7:1209:7458:61056	Treponema_vincentii_ATCC_35580_NZACYH00000000.1	Treponema_denticola_ATCC_35405
HWI-ST1242:77:C13H3ACXX:7:1301:15983:83574	Bordetella_bronchiseptica_strain_RB50		Rhodopirellula_baltica_SH_1_complete_genome
HWI-ST1242:77:C13H3ACXX:7:1301:15983:83574	Rhodopirellula_baltica_SH_1_complete_genome	Bordetella_bronchiseptica_strain_RB50
HWI-ST1242:77:C13H3ACXX:7:1103:12322:58936	Bordetella_bronchiseptica_strain_RB50		Desulfovibrio_piger_ATCC_29098
HWI-ST1242:77:C13H3ACXX:7:1103:12322:58936	Desulfovibrio_piger_ATCC_29098		Bordetella_bronchiseptica_strain_RB50
HWI-ST1242:77:C13H3ACXX:7:1210:12644:24976	Bacteroides_thetaiotaomicron_VPI-5482		Thermoanaerobacter_pseudethanolicus_ATCC_33223
HWI-ST1242:77:C13H3ACXX:7:1210:12644:24976	Thermoanaerobacter_pseudethanolicus_ATCC_33223	Bacteroides_thetaiotaomicron_VPI-5482
HWI-ST1242:77:C13H3ACXX:7:2108:8801:73269	Sulfurihydrogenibium_yellowstonense_SS-5	SulfuriYO3AOP1
HWI-ST1242:77:C13H3ACXX:7:2108:8801:73269	SulfuriYO3AOP1					Sulfurihydrogenibium_yellowstonense_SS-5
HWI-ST1242:77:C13H3ACXX:7:2108:7167:90955	Salinispora_arenicola_CNS-205			Shewanella_baltica_OS223,
HWI-ST1242:77:C13H3ACXX:7:2108:7167:90955	Shewanella_baltica_OS223,			Salinispora_arenicola_CNS-205
HWI-ST1242:77:C13H3ACXX:7:2303:16257:47096	Shewanella_baltica_OS185			Shewanella_baltica_OS223,
HWI-ST1242:77:C13H3ACXX:7:2303:16257:47096	Shewanella_baltica_OS223,			Shewanella_baltica_OS185
HWI-ST1242:77:C13H3ACXX:7:1305:15998:77806	Sulfurihydrogenibium_yellowstonense_SS-5	SulfuriYO3AOP1
HWI-ST1242:77:C13H3ACXX:7:1305:15998:77806	SulfuriYO3AOP1					Sulfurihydrogenibium_yellowstonense_SS-5
HWI-ST1242:77:C13H3ACXX:7:1215:18228:5604	Rhodopirellula_baltica_SH_1_complete_genome	Salinispora_arenicola_CNS-205
HWI-ST1242:77:C13H3ACXX:7:1215:18228:5604	Salinispora_arenicola_CNS-205			Rhodopirellula_baltica_SH_1_complete_genome
HWI-ST1242:77:C13H3ACXX:7:2109:5989:11352	Bordetella_bronchiseptica_strain_RB50		Leptothrix_cholodnii_SP-6
HWI-ST1242:77:C13H3ACXX:7:2109:5989:11352	Leptothrix_cholodnii_SP-6			Bordetella_bronchiseptica_strain_RB50
HWI-ST1242:77:C13H3ACXX:7:1203:6513:69746	Treponema_vincentii_ATCC_35580_NZACYH00000000.1	=
HWI-ST1242:77:C13H3ACXX:7:1203:6513:69746	Treponema_vincentii_ATCC_35580_NZACYH00000000.1	Treponema_denticola_ATCC_35405
HWI-ST1242:77:C13H3ACXX:7:2202:20027:26498	Treponema_denticola_ATCC_35405		Treponema_vincentii_ATCC_35580_NZACYH00000000.1
HWI-ST1242:77:C13H3ACXX:7:2202:20027:26498	Treponema_vincentii_ATCC_35580_NZACYH00000000.1	Treponema_denticola_ATCC_35405
HWI-ST1242:77:C13H3ACXX:7:2211:7978:77793	Akkermansia_muciniphila_ATCC_BAA-835		Bordetella_bronchiseptica_strain_RB50
HWI-ST1242:77:C13H3ACXX:7:2211:7978:77793	Bordetella_bronchiseptica_strain_RB50		Akkermansia_muciniphila_ATCC_BAA-835
HWI-ST1242:77:C13H3ACXX:7:2201:12614:25584	Thermotoga_petrophila_RKU-1			Thermotoga_sp._RQ2
HWI-ST1242:77:C13H3ACXX:7:2201:12614:25584	Thermotoga_sp._RQ2				Thermotoga_petrophila_RKU-1
...
...

where 1st column is the read identifier (the read identifiers will repeat on subsequent rows and may have a multiplicity of 2 or more), 2nd and 3rd column give linkage information. For example, let's grep the records from the SAM file using one of the identifiers given above:

[uzi@quince-srv2 ~/oneliners]$ grep -i "HWI-ST1242:77:C13H3ACXX:7:1203:6513:69746" *.sam

HWI-ST1242:77:C13H3ACXX:7:1203:6513:69746	65	Treponema_vincentii_ATCC_35580_NZACYH00000000.1	1506973	41	77M24S	Treponema_denticola_ATCC_35405	57994	0	AAAATGTACATCCCTGTACATTTTTTACCTACGTGTTTTCAAGGCAAACACACTACTGATTATAACCACCCCCATCCCCCCCCCAAAACATACATCCCTGT	@CCFFFDDHHHHHJJJJJJJJJJJJJJJIJIJJJD11?0*00*0)0?G;D###################################################	NM:i:4	AS:i:57	XS:i:33
HWI-ST1242:77:C13H3ACXX:7:1203:6513:69746	129	Treponema_denticola_ATCC_35405	57994	30	39S56M6S	Treponema_vincentii_ATCC_35580_NZACYH00000000.1	1506973	0	GGCAAAATCAGGTAATCCTTTTGAAATATGCGGGGCATCTACTGCGTCGTTACGCAAAAGCGAATGCTCAACGTACAGCAAGTACGCCTCCGCTTCGCTTT	BCCFFFFFHHHHFHHIIIJJJJJIJJJIJJJJJJJGIGIIGIGIEE;BDDAACA@?BBD>9=8@5?::>AC?588A243(934>C79@89A9309559588	NM:i:3	AS:i:41	XS:i:27	XP:Z:Treponema_vincentii_ATCC_35580_NZACYH00000000.1,-1507188,61S40M,30,0;
HWI-ST1242:77:C13H3ACXX:7:1203:6513:69746	145	Treponema_vincentii_ATCC_35580_NZACYH00000000.1	1507188	30	61S40M	=	1506973	-255	AAAGCGAAGCGGAGGCGTACTTGCTGTACGTTGAGCATTCGCTTTTGCGTAACGACGCAGTAGATGCCCCGCATATTTCAAAAGGATTACCTGATTTTGCC	8859559039A98@97C>439(342A885?CA>::?5@8=9>DBB?@ACAADDB;EEIGIGIIGIGJJJJJJJIJJJIJJJJJIIIHHFHHHHFFFFFCCB	NM:i:0	AS:i:40	XS:i:29	XP:Z:Treponema_denticola_ATCC_35405,+57994,39S56M6S,30,3;

Here, the read is paired-end read 1 if it is the member of the set (99,83,67,115,81,97,65,113) and is paired-end read 2 if it is the member of the set (147,163,131,179,161,145,129,177), respectively. So the first record returned from the grep statement is paired-end read 1 and next two records are paired-end read 2. You can just get rid of redundant column $7 is "=" record and run the query as (it ran in 7.7 seconds):

[uzi@quince-srv2 ~/oneliners]$ awk '$2~/^99$|^147$|^83$|^163$|^67$|^131$|^115$|^179$|^81$|^161$|^97$|^145$|^65$|^129$|^113$|^177$/ && $2!~/^SN/ && $7!~/=/{print $1"\t"$3"\t"$7}' aln-pe.sam | sort -n -k1,1 | uniq | awk 'BEGIN { FS="\t" } { c[$1]++; l[$1,c[$1]]=$0 } END { for (i in c) { if (c[i] > 1) for (j = 1; j <= c[i]; j++) print l[i,j] } }'

HWI-ST1242:77:C13H3ACXX:7:2207:10952:95139	Persephonella_marina_EX-H1		Thermoanaerobacter_pseudethanolicus_ATCC_33223
HWI-ST1242:77:C13H3ACXX:7:2207:10952:95139	Thermoanaerobacter_pseudethanolicus_ATCC_33223	Persephonella_marina_EX-H1
HWI-ST1242:77:C13H3ACXX:7:2216:3177:14124	Shewanella_baltica_OS185		Shewanella_baltica_OS223,
HWI-ST1242:77:C13H3ACXX:7:2216:3177:14124	Shewanella_baltica_OS223,		Shewanella_baltica_OS185
HWI-ST1242:77:C13H3ACXX:7:2205:11785:35383	Deinococcus_radiodurans_R1_chromosome_1,_complete_sequence	Leptothrix_cholodnii_SP-6
HWI-ST1242:77:C13H3ACXX:7:2205:11785:35383	Leptothrix_cholodnii_SP-6		Deinococcus_radiodurans_R1_chromosome_1,_complete_sequence
HWI-ST1242:77:C13H3ACXX:7:2104:19027:61363	Sulfurihydrogenibium_yellowstonense_SS-5	SulfuriYO3AOP1
HWI-ST1242:77:C13H3ACXX:7:2104:19027:61363	SulfuriYO3AOP1				Sulfurihydrogenibium_yellowstonense_SS-5
HWI-ST1242:77:C13H3ACXX:7:1301:5017:76112	Sulfurihydrogenibium_yellowstonense_SS-5	SulfuriYO3AOP1
HWI-ST1242:77:C13H3ACXX:7:1301:5017:76112	SulfuriYO3AOP1				Sulfurihydrogenibium_yellowstonense_SS-5
HWI-ST1242:77:C13H3ACXX:7:2316:4885:64809	Sulfurihydrogenibium_yellowstonense_SS-5	SulfuriYO3AOP1
HWI-ST1242:77:C13H3ACXX:7:2316:4885:64809	SulfuriYO3AOP1				Sulfurihydrogenibium_yellowstonense_SS-5
HWI-ST1242:77:C13H3ACXX:7:2305:6004:6780	Acidobacterium_capsulatum_ATCC_51196	Bordetella_bronchiseptica_strain_RB50
HWI-ST1242:77:C13H3ACXX:7:2305:6004:6780	Bordetella_bronchiseptica_strain_RB50	Acidobacterium_capsulatum_ATCC_51196
HWI-ST1242:77:C13H3ACXX:7:2302:12911:73962	Gemmatimonas_aurantiaca_T-27_DNA	gi|83591340|ref|NC_007643.1|
HWI-ST1242:77:C13H3ACXX:7:2302:12911:73962	gi|83591340|ref|NC_007643.1|		Gemmatimonas_aurantiaca_T-27_DNA
HWI-ST1242:77:C13H3ACXX:7:1205:3889:3509	Thermotoga_neapolitana_DSM_4359	Thermotoga_sp._RQ2
HWI-ST1242:77:C13H3ACXX:7:1205:3889:3509	Thermotoga_sp._RQ2			Thermotoga_neapolitana_DSM_4359

The purpose of the final awk statment is to remove singletons i.e. get rid of reads where both paired-ends match to the same genome/contig. Refer to the following table on how I have used the flags:

Does subsampling improve assembly quality for very high coverage datasets when using Velvet without removing noise from reads?

I assembled Nanoarchaeum Equitans for which there were 1639910 trimmed forward and reverse reads (Trimming was done using Sickle version 1.200 with a minimum window quality score of 20. Reads shorter than 10 bp. after trimming were removed). This dataset was sequenced on MISEQ machine using Nextera library. I used all 1639910 reads in VelvetOptimiser with n50 score as an optimisation criteria and kmer range as [21,99]. In the final assembly, I found optimum kmer size=99 and obtained 13795 contigs with an n50 score of 201 which was NOT ACCEPTABLE. Neil Hall suggested that perhaps, the problem is that we have too much coverage which is propagating errors because of depth of assignments and may be subsampling is the way to go. So I subsampled from the reads taking 10% (163991), 20% (327982), 30% (491973), 40% (655964), and 50% (819955), 60% (983946), and 70% (1147937) of the reads respectively using reservoir sampling

Subsampling 10% reads:

[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=163991 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p10.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p10.fastq"}'
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p10.fastq NE_R2_p10.fastq' -t 10 --optFuncKmer 'n50' &

Subsampling 20% reads:

[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=327982 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p20.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p20.fastq"}'
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p20.fastq NE_R2_p20.fastq' -t 10 --optFuncKmer 'n50' &

Subsampling 30% reads:

[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=491973 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p30.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p30.fastq"}'
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p30.fastq NE_R2_p30.fastq' -t 10 --optFuncKmer 'n50' &

Subsampling 40% reads:

[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=655964 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p40.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p40.fastq"}'
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p40.fastq NE_R2_p40.fastq' -t 10 --optFuncKmer 'n50' &

Subsampling 50% reads:

[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=819955 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p50.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p50.fastq"}'
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p50.fastq NE_R2_p50.fastq' -t 10 --optFuncKmer 'n50' &

Subsampling 60% reads:

[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=983946 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p60.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p60.fastq"}'
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p60.fastq NE_R2_p60.fastq' -t 10 --optFuncKmer 'n50' &

Subsampling 70% reads:

[uzi@quince-srv2 ~/oneliners]$ paste NE_R1.fastq NE_R2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t");} }' | awk -v k=1147937 'BEGIN{srand(systime() + PROCINFO["pid"]);}{s=x++ "NE_R1_p70.fastq";print $2"\n"$4"\n"$6"\n"$8 > "NE_R2_p70.fastq"}'
[uzi@quince-srv2 ~/oneliners]$nohup /home/opt/VelvetOptimiser/VelvetOptimiser.pl -s 21 -e 99 -f '-shortPaired -fastq -separate NE_R1_p70.fastq NE_R2_p70.fastq' -t 10 --optFuncKmer 'n50' &

The results are given below:

% of reads	Optimum kmer size	No of contigs	n50 score	Length of longest contig	Number of contigs > 1k
100%	99	13795	201	1950	28
70%	99	2386	1564	288878	18
60%	99	611	396479	396479	11
50%	95	151	489315	489315	14
40%	95	265	451202	451202	16
30%	95	661	457328	457328	10
20%	99	330	101099	131609	16
10%	95	1277	2263	13314	121

To validate assembly, we have used the paper Read Length and Repeat Resolution: Exploring Prokaryote Genomes Using Next-Generation Sequencing Technologies. Table S1 from supplementary data is an excel sheet that gives read length and gap benchmarks for 818 bacterial genomes. From there, for the closely-related strain Nanoarchaeum equitans Kin4-M (genome length: 490885), with 125bp reads, 2 gaps are predicted, and with 250bp reads, 0 gaps are predicted in final assembly. It can be seen that with 50% of the reads, the longest contig 489315 is 99.68017% of the genome length of Nanoarchaeum equitans Kin4-M. And so the verdict is that: Yes SUBSAMPLING does improve assembly quality for very high coverage datasets when using Velvet without removing noise from reads.

Extracting 16S rRNA sequences from NCBI's locally installed nt database using blastdbcmd

For each command given below, the first blastdbcmd invocation produces 2 entries per sequence (GI and title of the sequence), the awk command selects from the output of that command those sequences which have "16S" in their title and prints their GIs, and finally the second blastdbcmd invocation uses those GIs to print the sequence data for the 16S rRNA sequences in the nt database. Since nt database is huge, the methods given below use nohup to run the command in the background and they will continue even if you log off from a remote SSH session.

Method 1: Specify the organism names in the command itself:

[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/ncbi-blast-2.2.28+/bin/blastdbcmd -db /home/opt/ncbi-blast-2.2.28+/db/nt -entry all -outfmt "%g;;%t" | awk -F";;" '/16S/ && /Archaeoglobus fulgidus|Desulfovibrio piger|Herpetosiphon aurantiacus|Ignicoccus hospitalis|Methanocaldococcus jannaschii|Methanococcus maripaludis|Nanoarchaeum equitans|Pyrobaculum aerophilum|Pyrobaculum calidifontis|Pyrococcus horikoshii|Rhodopirellula baltica|Sulfolobus tokodaii/{print $1}' | /home/opt/ncbi-blast-2.2.28+/bin/blastdbcmd -db /home/opt/ncbi-blast-2.2.28+/db/nt -entry_batch -  -out sequences.fasta &

Method 2: Specify the organism names in a file organisms.txt:

[uzi@quince-srv2 ~/oneliners]$ cat organisms.txt 
Archaeoglobus fulgidus
Desulfovibrio piger
Herpetosiphon aurantiacus
Ignicoccus hospitalis
Methanocaldococcus jannaschii
Methanococcus maripaludis
Nanoarchaeum equitans
Pyrobaculum aerophilum
Pyrobaculum calidifontis
Pyrococcus horikoshii
Rhodopirellula baltica
Sulfolobus tokodaii
[uzi@quince-srv2 ~/oneliners]$ nohup /home/opt/ncbi-blast-2.2.28+/bin/blastdbcmd -db /home/opt/ncbi-blast-2.2.28+/db/nt -entry all -outfmt "%g;;%t" | grep -F "$(<organisms.txt)" | awk -F";;" '/16S/{print $1}' | /home/opt/ncbi-blast-2.2.28+/bin/blastdbcmd -db /home/opt/ncbi-blast-2.2.28+/db/nt -entry_batch -  -out sequences2.fasta &

Now check the header information of the sequences we have pulled out:

[uzi@quince-srv2 ~/oneliners]$ grep -i ">" sequences.fasta
>gi|444303911|ref|NR_074334.1| Archaeoglobus fulgidus DSM 4304 strain DSM 4304 16S ribosomal RNA, complete sequence >gi|38792|emb|X05567.1| Archaeoglobus fulgidus strain VC-16 16S ribosomal RNA gene
>gi|444303911|ref|NR_074334.1| Archaeoglobus fulgidus DSM 4304 strain DSM 4304 16S ribosomal RNA, complete sequence >gi|38792|emb|X05567.1| Archaeoglobus fulgidus strain VC-16 16S ribosomal RNA gene
>gi|38805|emb|Y00275.1| Archaeoglobus fulgidus (VC-16) DNA for 16S ribosomal RNA
>gi|294440|gb|L07510.1|PYBRGDX Pyrobaculum aerophilum 16S ribosomal RNA gene
>gi|1145365|gb|U38484.1|MMU38484 Methanococcus maripaludis 16S ribosomal RNA, complete sequence
>gi|1145366|gb|U38486.1|MMU38486 Methanococcus maripaludis 16S ribosomal RNA, complete sequence
>gi|1145367|gb|U38487.1|MMU38487 Methanococcus maripaludis 16S ribosomal RNA, complete sequence
>gi|1145368|gb|U38941.1|MMU38941 Methanococcus maripaludis 16S ribosomal RNA, complete sequence
>gi|2232234|gb|AF005049.1|AF005049 Methanococcus maripaludis 16S ribosomal RNA gene, complete sequence
>gi|2564752|dbj|D45214.1|PYWOT3 Pyrococcus horikoshii 16S rRNA gene, complete sequence
>gi|343201091|ref|NR_041778.1| Desulfovibrio piger ATCC 29098 strain ATCC29098 16S ribosomal RNA, complete sequence >gi|6979515|gb|AF192152.1|AF192152 Desulfomonas pigra 16S ribosomal RNA gene, complete sequence
>gi|18151366|dbj|AB022438.1| Sulfolobus tokodaii genes for 16S rRNA, 23S rRNA, complete and partial sequence
>gi|20429035|emb|AJ318041.1| Nanoarchaeum equitans Kin4-M 16S rRNA gene, strain Kin4-M
>gi|265678650|ref|NR_028955.1| Ignicoccus hospitalis KIN4/I 16S ribosomal RNA, partial sequence >gi|20429036|emb|AJ318042.1| Ignicoccus sp. Kin4-I 16S rRNA gene, strain Kin4-I
>gi|343200235|ref|NR_040922.1| Pyrobaculum calidifontis JCM 11548 strain VA1 16S ribosomal RNA, complete sequence >gi|27807847|dbj|AB078332.1| Pyrobaculum calidifontis gene for 16S rRNA, complete sequence
>gi|343200235|ref|NR_040922.1| Pyrobaculum calidifontis JCM 11548 strain VA1 16S ribosomal RNA, complete sequence >gi|27807847|dbj|AB078332.1| Pyrobaculum calidifontis gene for 16S rRNA, complete sequence
>gi|2913866|dbj|D87344.1| Pyrococcus horikoshii gene for 16S ribosomal RNA, partial sequence
>gi|89077582|gb|DQ374392.1| Archaeoglobus fulgidus strain L3 16S ribosomal RNA gene, partial sequence
>gi|265678749|ref|NR_029054.1| Pyrococcus horikoshii strain JA-1 16S ribosomal RNA, partial sequence >gi|37543825|gb|AY099169.1| Pyrococcus horikoshii strain DSM 12428 16S ribosomal RNA gene, partial sequence
>gi|265678749|ref|NR_029054.1| Pyrococcus horikoshii strain JA-1 16S ribosomal RNA, partial sequence >gi|37543825|gb|AY099169.1| Pyrococcus horikoshii strain DSM 12428 16S ribosomal RNA gene, partial sequence
>gi|37543862|gb|AY099205.1| Pyrococcus horikoshii strain DSM 12428 16S-23S ribosomal RNA intergenic spacer region, complete sequence
>gi|116496446|gb|EF012748.1| Rhodopirellula baltica strain OJF4 16S ribosomal RNA gene, partial sequence
>gi|116496447|gb|EF012749.1| Rhodopirellula baltica strain OJF5 16S ribosomal RNA gene, partial sequence
>gi|116496448|gb|EF012750.1| Rhodopirellula baltica strain OJF6 16S ribosomal RNA gene, partial sequence
>gi|126211523|gb|EF421451.1| Rhodopirellula baltica strain OJF13 16S ribosomal RNA gene, partial sequence
>gi|126211524|gb|EF421452.1| Rhodopirellula baltica strain OJF14 16S ribosomal RNA gene, partial sequence
>gi|126211525|gb|EF421453.1| Rhodopirellula baltica strain OJF15 16S ribosomal RNA gene, partial sequence
>gi|148357800|gb|EF589343.1| Rhodopirellula baltica strain F1 16S ribosomal RNA gene, partial sequence
>gi|148357801|gb|EF589344.1| Rhodopirellula baltica strain Gtu1 16S ribosomal RNA gene, partial sequence
>gi|148357802|gb|EF589345.1| Rhodopirellula baltica strain G1 16S ribosomal RNA gene, partial sequence
>gi|148357804|gb|EF589347.1| Rhodopirellula baltica strain M5 16S ribosomal RNA gene, partial sequence
>gi|148357806|gb|EF589349.1| Rhodopirellula baltica strain Gr16 16S ribosomal RNA gene, partial sequence
>gi|148357811|gb|EF589354.1| Rhodopirellula baltica strain Cc1 16S ribosomal RNA gene, partial sequence
>gi|171879796|gb|EU545801.1| Sulfolobus tokodaii strain TW 16S ribosomal RNA gene, partial sequence
...

Resolving NCBI Taxonomy using BioSQL

Download BioSQL from here. Once the software is installed, setup a database and import the BioSQL schema. The following command line should create a new database on your own computer called bioseqdb, belonging to the root user account:

[uzi@quince-srv2 ~/oneliners]$ mysqladmin -u root create bioseqdb

We can then tell MySQL to load the BioSQL scheme we downloaded above. Change to the sql subdirectory from the unzipped BioSQL download, then:

[uzi@quince-srv2 ~/oneliners/biosql-1.0.1/sql]$ mysql -u root bioseqdb < biosqldb-mysql.sql

To update the NCBI taxonomy, change to the scripts subdirectory and use the following command:

[uzi@quince-srv2 ~/oneliners/biosql-1.0.1/scripts]$ ./load_ncbi_taxonomy.pl --dbname bioseqdb --driver mysql --dbuser root --download true
Loading NCBI taxon database in taxdata:
        ... retrieving all taxon nodes in the database
        ... reading in taxon nodes from nodes.dmp
        ... insert / update / delete taxon nodes
        ... (committing nodes)
        ... rebuilding nested set left/right values
        ... reading in taxon names from names.dmp
        ... deleting old taxon names
        ... inserting new taxon names
        ... cleaning up
Done.

Go to oneliners folder and check which tables are imported:

[uzi@quince-srv2 ~/oneliners]$ mysql --user=root bioseqdb -e "show tables"
+----------------------------+
| Tables_in_bioseqdb         |
+----------------------------+
| biodatabase                |
| bioentry                   |
| bioentry_dbxref            |
| bioentry_path              |
| bioentry_qualifier_value   |
| bioentry_reference         |
| bioentry_relationship      |
| biosequence                |
| comment                    |
| dbxref                     |
| dbxref_qualifier_value     |
| location                   |
| location_qualifier_value   |
| ontology                   |
| reference                  |
| seqfeature                 |
| seqfeature_dbxref          |
| seqfeature_path            |
| seqfeature_qualifier_value |
| seqfeature_relationship    |
| taxon                      |
| taxon_name                 |
| term                       |
| term_dbxref                |
| term_path                  |
| term_relationship          |
| term_relationship_term     |
| term_synonym               |
+----------------------------+

Check how many databases are there:

[uzi@quince-srv2 ~/oneliners]$  mysql --user=root bioseqdb
mysql> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| bioseqdb           |
| mysql              |
| test               |
+--------------------+
 
mysql> use bioseqdb
Database changed

Check the attributes of the table "taxon":

mysql> select column_name from information_schema.columns where table_name='taxon';

+-------------------+
| column_name       |
+-------------------+
| taxon_id          |
| ncbi_taxon_id     |
| parent_taxon_id   |
| node_rank         |
| genetic_code      |
| mito_genetic_code |
| left_value        |
| right_value       |
+-------------------+

Check the attributes of the table "taxon_name":

mysql> select column_name from information_schema.columns where table_name='taxon_name'; 

+-------------+
| column_name |
+-------------+
| taxon_id    |
| name        |
| name_class  |
+-------------+

Check if the tables are populated:

mysql> SELECT ncbi_taxon_id from taxon LIMIT 10;

+---------------+
| ncbi_taxon_id |
+---------------+
|             1 |
|             2 |
|             6 |
|             7 |
|             9 |
|            10 |
|            11 |
|            13 |
|            14 |
|            16 |
+---------------+

Now we will run the following commands one by one to find the path of taxon id 54255 to the root node. We will do this by finding the parent taxon id of the given id, and then using this id to find the parents and repeating the procedure until we have reached the root node:

mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.ncbi_taxon_id=54255;
+--------------------+-----------+-----------------+
| name               | node_rank | parent_taxon_id |
+--------------------+-----------+-----------------+
| Thermofilum librum | species   |            1793 |
+--------------------+-----------+-----------------+

mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=1793;
+-------------+-----------+-----------------+
| name        | node_rank | parent_taxon_id |
+-------------+-----------+-----------------+
| Thermofilum | genus     |           86865 |
+-------------+-----------+-----------------+

mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=86865;
+----------------+-----------+-----------------+
| name           | node_rank | parent_taxon_id |
+----------------+-----------+-----------------+
| Thermofilaceae | family    |            1791 |
+----------------+-----------+-----------------+

mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=1791;
+-----------------+-----------+-----------------+
| name            | node_rank | parent_taxon_id |
+-----------------+-----------+-----------------+
| Thermoproteales | order     |          149927 |
+-----------------+-----------+-----------------+

mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=149927;
+--------------+-----------+-----------------+
| name         | node_rank | parent_taxon_id |
+--------------+-----------+-----------------+
| Thermoprotei | class     |           12664 |
+--------------+-----------+-----------------+

mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=12664;
+---------------+-----------+-----------------+
| name          | node_rank | parent_taxon_id |
+---------------+-----------+-----------------+
| Crenarchaeota | phylum    |            1705 |
+---------------+-----------+-----------------+

mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=1705;
+---------+--------------+-----------------+
| name    | node_rank    | parent_taxon_id |
+---------+--------------+-----------------+
| Archaea | superkingdom |          102425 |
+---------+--------------+-----------------+

mysql> SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id = taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=102425;

+--------------------+-----------+-----------------+
| name               | node_rank | parent_taxon_id |
+--------------------+-----------+-----------------+
| cellular organisms | no rank   |               1 |
+--------------------+-----------+-----------------+

I have written a small stored procedure to do the same thing as above. Make a new file named path_to_root_node.sql with the following contents:

DROP PROCEDURE IF EXISTS path_to_root_node;
DELIMITER //
CREATE PROCEDURE path_to_root_node(IN leaf INT ) 
BEGIN 
	DECLARE parent_taxon_id INT; 
	DECLARE taxon_name VARCHAR(255); 
	DECLARE taxon_rank VARCHAR(255); 
	DECLARE path_to_root VARCHAR(255); 
	SET @path_to_root = ''; 
	SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id INTO @taxon_name, @taxon_rank, @parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id=taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.ncbi_taxon_id=leaf; 
	SET @path_to_root=CONCAT(@path_to_root,@taxon_name,':',@taxon_rank);
	loop_label: LOOP 
		SELECT taxon_name.name, taxon.node_rank, taxon.parent_taxon_id INTO @taxon_name, @taxon_rank, @parent_taxon_id FROM taxon, taxon_name where taxon.taxon_id=taxon_name.taxon_id AND taxon_name.name_class='scientific name' AND taxon.taxon_id=@parent_taxon_id; 
		SET @path_to_root=CONCAT(@path_to_root,";",@taxon_name,':',@taxon_rank); 
		IF (@parent_taxon_id <> 1) THEN 
			ITERATE loop_label; 
		ELSE 
			LEAVE loop_label; 
		END IF; 
	END LOOP; 
	SELECT @path_to_root; 
END//
DELIMITER ;

Now load the stored procedure, call it, and it will give you the complete path from the given taxon id to root node

mysql> source path_to_root_node.sql
mysql> call path_to_root_node(54255);

|+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+|
@path_to_root
|+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+|
Thermofilum librum:species;Thermofilum:genus;Thermofilaceae:family;Thermoproteales:order;Thermoprotei:class;Crenarchaeota:phylum;Archaea:superkingdom;cellular organisms:no rank
|+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+|

You can also use the following one-liners on linux prompt using the above stored procedure:

[uzi@quince-srv2 ~/oneliners]$  mysql --user=root bioseqdb -e "source path_to_root_node.sql;call path_to_root_node(54255);" | awk -F"|" '$0!~/^@path_to_root/{print $0}'
Thermofilum librum:species;Thermofilum:genus;Thermofilaceae:family;Thermoproteales:order;Thermoprotei:class;Crenarchaeota:phylum;Archaea:superkingdom;cellular organisms:no rank
[uzi@quince-srv2 ~/oneliners]$ mysql --user=root bioseqdb -e "source path_to_root_node.sql;call path_to_root_node(9606);" | awk -F"|" '$0!~/^@path_to_root/{print $0}'
Homo sapiens:species;Homo:genus;Homininae:subfamily;Hominidae:family;Hominoidea:superfamily;Catarrhini:parvorder;Simiiformes:infraorder;Haplorrhini:suborder;Primates:order;Euarchontoglires:superorder;Eutheria:no rank;Theria:no rank;Mammalia:class;Amniota:no rank;Tetrapoda:no rank;Sarcopterygii:no rank;Euteleostomi:no rank;Teleostomi:no rank;Gnathostomata:superclass;Vertebrata:no rank;Craniata:subphylum;Chordata:phylum;Deuterostomia:no rank;Bilateria:no rank;Eumetazoa:no rank;Metazoa:kingdom;Opisthokonta:no rank;Eukaryota:superkingdom;cellular organisms:no rank

You can also do other fancy things with the stored-procedure. Say you have a list of accession numbers stored in a file accnos.list:

[uzi@quince-srv2 ~/oneliners]$ cat accnos.list 
AB000389.1
AB000699.1
AB000700.1
AB000701.1
AB000702.1
AB001518.1
AB001724.1
AB001774.1
AB001775.1
AB001776.1
AB001777.1
AB001779.1
AB001781.1
AB001783.1
AB001784.1
AB001785.1
AB001791.1
AB001793.1
AB001797.1
AB001802.1

You are interested in finding the corresponding gID, taxa ID, and assignment at species and genus level for each accession number. You can get the gID and taxa ID from locally installed NT database using blastdbcmd and resolve the assignment at species and genus level using path_to_root_node.sql. For example, in the following script, we get the assignment at species level

[uzi@quince-srv2 ~/oneliners]$ cat accnos.list | /home/opt/ncbi-blast-2.2.28+/bin/blastdbcmd -db /home/opt/ncbi-blast-2.2.28+/db/nt -entry_batch - -outfmt "%a,%g,%T" | while IFS=$',' read -r -a myArray; do echo ${myArray[0]},${myArray[1]},${myArray[2]},$(mysql --user=root bioseqdb -e "source path_to_root_node.sql; call path_to_root_node(${myArray[2]});" | awk -F "|" '$0!~/^@path_to_root/{print $0}' | perl -ne '@a=split(/:/,join(/,/,grep {/species/} split(/;/,$_)));print $a[0]'); done
AB000389.1,11036396,81037,Pseudoalteromonas elyakovii
AB000699.1,3107908,153948,Nitrosomonas sp. AL212
AB000700.1,3107909,153949,Nitrosomonas sp. JL21
AB000701.1,3107910,153947,Nitrosomonas sp. GH22
AB000702.1,3107911,42353,Nitrosomonas sp.
AB001518.1,1871429,47467,Ixodes scapularis endosymbiont
AB001724.1,1902830,267859,Microcystis elabens
AB001774.1,1902837,85991,Chlamydia pecorum
AB001775.1,1902838,85991,Chlamydia pecorum
AB001776.1,1902839,85991,Chlamydia pecorum
AB001777.1,1902840,85991,Chlamydia pecorum
AB001779.1,1902842,83554,Chlamydia psittaci
AB001778.1,1902841,331636,Chlamydia psittaci
AB001780.1,1902843,83554,Chlamydia psittaci
AB001781.1,1902844,83554,Chlamydia psittaci
AB001782.1,1902845,83554,Chlamydia psittaci
AB001786.1,1902849,83554,Chlamydia psittaci
AB001787.1,1902850,83554,Chlamydia psittaci
AB001788.1,1902851,83554,Chlamydia psittaci
AB001789.1,1902852,83554,Chlamydia psittaci
AB001790.1,1902853,83554,Chlamydia psittaci
AB001812.1,1902875,83554,Chlamydia psittaci
AB001783.1,1902846,83555,Chlamydophila abortus
AB001784.1,1902847,83554,Chlamydia psittaci
AB001785.1,1902848,83556,Chlamydophila felis
AB001791.1,1902854,83554,Chlamydia psittaci
AB001793.1,1902856,83554,Chlamydia psittaci
AB001794.1,1902857,83554,Chlamydia psittaci
AB001800.1,1902863,83554,Chlamydia psittaci
AB001801.1,1902864,83554,Chlamydia psittaci
AB001803.1,1902866,83554,Chlamydia psittaci
AB001796.1,1902859,83554,Chlamydia psittaci
AB001797.1,1902860,83554,Chlamydia psittaci
AB001799.1,1902862,83554,Chlamydia psittaci
AB001802.1,1902865,83554,Chlamydia psittaci

Using "genus" as search pattern in grep command below gives assignment at genus level:

[uzi@quince-srv2 ~/oneliners]$ cat accnos.list | /home/opt/ncbi-blast-2.2.28+/bin/blastdbcmd -db /home/opt/ncbi-blast-2.2.28+/db/nt -entry_batch - -outfmt "%a,%g,%T" | while IFS=$',' read -r -a myArray; do echo ${myArray[0]},${myArray[1]},${myArray[2]},$(mysql --user=root bioseqdb -e "source path_to_root_node.sql; call path_to_root_node(${myArray[2]});" | awk -F "|" '$0!~/^@path_to_root/{print $0}' | perl -ne '@a=split(/:/,join(/,/,grep {/genus/} split(/;/,$_)));print $a[0]'); done
AB000389.1,11036396,81037,Pseudoalteromonas
AB000699.1,3107908,153948,Nitrosomonas
AB000700.1,3107909,153949,Nitrosomonas
AB000701.1,3107910,153947,Nitrosomonas
AB000702.1,3107911,42353,Nitrosomonas
AB001518.1,1871429,47467,
AB001724.1,1902830,267859,Microcystis
AB001774.1,1902837,85991,Chlamydia
AB001775.1,1902838,85991,Chlamydia
AB001776.1,1902839,85991,Chlamydia
AB001777.1,1902840,85991,Chlamydia
AB001779.1,1902842,83554,Chlamydia
AB001778.1,1902841,331636,Chlamydia
AB001780.1,1902843,83554,Chlamydia
AB001781.1,1902844,83554,Chlamydia
AB001782.1,1902845,83554,Chlamydia
AB001786.1,1902849,83554,Chlamydia
AB001787.1,1902850,83554,Chlamydia
AB001788.1,1902851,83554,Chlamydia
AB001789.1,1902852,83554,Chlamydia
AB001790.1,1902853,83554,Chlamydia
AB001812.1,1902875,83554,Chlamydia
AB001783.1,1902846,83555,Chlamydophila
AB001784.1,1902847,83554,Chlamydia
AB001785.1,1902848,83556,Chlamydophila
AB001791.1,1902854,83554,Chlamydia
AB001793.1,1902856,83554,Chlamydia
AB001794.1,1902857,83554,Chlamydia
AB001800.1,1902863,83554,Chlamydia
AB001801.1,1902864,83554,Chlamydia
AB001803.1,1902866,83554,Chlamydia
AB001796.1,1902859,83554,Chlamydia
AB001797.1,1902860,83554,Chlamydia
AB001799.1,1902862,83554,Chlamydia
AB001802.1,1902865,83554,Chlamydia

Last Updated by Dr Umer Zeeshan Ijaz on 13/04/2014.