Extracting subset of records from FASTA/FASTQ files based on exact/pattern matches of IDs (ONE-LINERS)

by Umer Zeeshan Ijaz

To extract subset of records from FASTA/FASTQ files based on IDs provided in IDs.txt, use the following one-liners:

FASTA file:

$ cat IDs.txt | awk '{gsub("_","\\_",$0);$0="(?s)^>"$0".*?(?=\\n(\\z|>))"}1' | pcregrep -oM -f - file.fasta

FASTQ file:

$ cat IDs.txt | awk '{gsub("_","\\_",$0);$0="^@"$0".*?(\\n.*){3}"}1' | pcregrep -oM -f - file.fastq

Deconstructing the code:

We are using pcregrep which searches for character patterns in the same way as other grep commands do, but with PCRE regular expression library to support patterns that are compatible with the regular expressions of Perl 5. The advantage is to use -M, --multiline switch that allows patterns to match more than one line. With this option, patterns may contain literal newline characters and internal occurrences of ^ and $ characters.  However, there is a limit to the number of lines that can be matched, imposed by the way that pcregrep buffers the input file as it scans it. pcregrep ensures that atleast 8K characters (in newer versions 20K) or the rest of the document (whichever is the  shorter) are  available for forward matching, and similarly the previous 8K characters (or all the previous characters, if fewer than 8K) are guaranteed to be available for look-behind assertions. This 8K or 20K buffer size limit is sufficient for pulling out shorter reads, however, for larger contigs assembled from assembly software, the newer version of pcregrep (tested on version 8.35) should be installed and buffer size manipulated with --buffer-size switch (as can be seen later). With -o switch, we only show the part of the line that matches the pattern. -f filename, --file=filename reads a pattern from the file, one per line, and matches them against each line of the input. To read input from STDIN, we are using it as -f -.

(?=pattern) Positive + Look-ahead

$ echo "@MSQ-M01442:38:000000000-A5H4M:1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA" | grep -Po '.*\-(?=A5H4M)'

@MSQ-M01442:38:000000000-

The above matches .*\_ followed by A5H4M, without including A5H4M in $&

(?!pattern) Negative + Look-ahead

Say we want to capture everything up to and including first hyphen in the echoed string

$ echo "@MSQ-M01442:38:000000000-A5H4M:1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA" | grep -Po '.*\-(?!A5H4M)'

@MSQ-

The above will match .*\_ that isn't followed by A5H4M.

(?<=pattern) or \K Positive + Look-behind

$ echo "@MSQ-M01442:38:000000000-A5H4M:1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA" | grep -Po '(?<=A5H4M:)1:.*'
1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA
$ echo "@MSQ-M01442:38:000000000-A5H4M:1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA" | grep -Po 'A5H4M:\K1:.*'

1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA

The above will match 1:.* following A5H4M:, without including A5H4M: in $&

(?<!pattern) Negative + Look-behind

Say we want the next match for 1:.* (after the previous one)

$ echo "@MSQ-M01442:38:000000000-A5H4M:1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA" | grep -Po '(?<!A5H4M:)1:.*'

1:10737:16051 1:N:0:CGAGGCTGAAGGAGTA

The above will match 1:.* that doesn't follow A5H4M:.

Now let us see what the code does for a given FASTA file:

$ cat f1.fasta

>F1
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
>F2
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
>F3
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
>F4_C1
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
>F5_C2
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG

It should be immediately clear that the FASTA file is not linearised, i.e., the sequences may span multiple lines. Now let us extract some FASTA IDs:

$ grep -Po '(?<=^>).*' f1.fasta | sort -R | head -2 > IDs.txt
$ cat IDs.txt

F1
F5_C2

Where -P enables PCRE in grep, sort -R shuffles the headers, and we save the first two IDs using head -2. Now run the awk portion of the one-liner for FASTA:

$ cat IDs.txt | awk '{gsub("_","\\_",$0);$0="(?s)^>"$0".*?(?=\\n(\\z|>))"}1'

(?s)^>F1.*?(?=\n(\z|>))
(?s)^>F5\_C2.*?(?=\n(\z|>))

Here,

Using the IDs in IDs.txt, we can then grep the records as follows:

$ cat IDs.txt | awk '{gsub("_","\\_",$0);$0="(?s)^>"$0".*?(?=\\n(\\z|>))"}1' | pcregrep -oM -f - f1.fasta

>F1
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
>F5_C2
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG

We can also construct an alias FASTAgrep and use it instead of the longer one-liner

$ alias FASTAgrep="awk '{gsub(\"_\",\"\\\_\",\$0);\$0=\"(?s)^>\"\$0\".*?(?=\\\n(\\\z|>))\"}1' | pcregrep -oM -f -"
$ cat IDs.txt | FASTAgrep f1.fasta

>F1
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
>F5_C2
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG

FASTAgrep is then the building block for returning FASTA records based on a provided pattern for FASTA header. We can also embed PCRE in the patterns themselves.
Say we want to return the records for which IDs contain _C

$ echo ".._C" | FASTAgrep f1.fasta

>F4_C1
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
>F5_C2
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG

or if we want to return the records for which IDs do not contain _C

$ echo "..[^_]" | FASTAgrep f1.fasta

>F1
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
>F2
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
>F3
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
An obvious advantage of FASTAgrep is to filter out contigs generated from assembly software such as Velvet and MetaVelvet which store length and coverage information for each contig in it's header, for example, a given contigs.fa from Velvet with first few headers is as follows:
$ grep -i ">" contigs.fa | head -20

>NODE_1_length_205_cov_2.678049
>NODE_2_length_221_cov_3.239819
>NODE_3_length_3390_cov_20.385250
>NODE_4_length_164_cov_2.597561
>NODE_5_length_165_cov_4.327273
>NODE_6_length_5675_cov_18.628546
>NODE_7_length_167_cov_3.526946
>NODE_8_length_197_cov_2.903553
>NODE_9_length_141_cov_3.262411
>NODE_10_length_175_cov_3.777143
>NODE_11_length_1365_cov_14.082784
>NODE_12_length_1944_cov_7.141975
>NODE_13_length_1418_cov_18.503527
>NODE_14_length_423_cov_6.451537
>NODE_15_length_143_cov_2.111888
>NODE_16_length_445_cov_20.523596
>NODE_17_length_5129_cov_24.638330
>NODE_18_length_2905_cov_18.701204
>NODE_19_length_323_cov_3.560371
>NODE_20_length_6239_cov_17.427153
As an example, to filter out contigs smaller than 1000, we can use the following:
echo "NODE_\d+_length_(\d){4,}_" | FASTAgrep --buffer-size=100000000 contigs.fa | grep -i ">" | head -20
>NODE_3_length_3390_cov_20.385250
>NODE_6_length_5675_cov_18.628546
>NODE_11_length_1365_cov_14.082784
>NODE_12_length_1944_cov_7.141975
>NODE_13_length_1418_cov_18.503527
>NODE_17_length_5129_cov_24.638330
>NODE_18_length_2905_cov_18.701204
>NODE_20_length_6239_cov_17.427153
>NODE_22_length_4091_cov_19.720362
>NODE_24_length_4513_cov_14.317084
>NODE_26_length_9610_cov_19.642977
>NODE_28_length_2442_cov_21.612612
>NODE_30_length_8253_cov_18.978554
>NODE_31_length_15435_cov_19.081308
>NODE_33_length_1584_cov_22.407827
>NODE_34_length_1911_cov_15.783360
>NODE_35_length_1834_cov_16.521811
>NODE_37_length_4253_cov_18.201740
>NODE_40_length_3255_cov_17.576958
>NODE_41_length_6098_cov_17.564283
or to filter out contigs with coverage less than 20, we can use the following:
$ echo "NODE_\d+_length_\d+_cov_((\d){3,}|[2-9]\d)\." | FASTAgrep --buffer-size=100000000 contigs.fa | grep -i ">" | head -20
>NODE_3_length_3390_cov_20.385250
>NODE_16_length_445_cov_20.523596
>NODE_17_length_5129_cov_24.638330
>NODE_27_length_777_cov_30.042471
>NODE_28_length_2442_cov_21.612612
>NODE_33_length_1584_cov_22.407827
>NODE_60_length_15251_cov_67.381416
>NODE_65_length_1170_cov_21.969231
>NODE_86_length_8404_cov_74.705971
>NODE_98_length_8355_cov_23.382885
>NODE_111_length_910_cov_21.128571
>NODE_125_length_3085_cov_22.705349
>NODE_153_length_8307_cov_81.449623
>NODE_161_length_530_cov_46.290565
>NODE_162_length_443_cov_31.598194
>NODE_163_length_2228_cov_21.039946
>NODE_166_length_2736_cov_20.957237
>NODE_175_length_7338_cov_22.985283
>NODE_179_length_831_cov_26.533092
>NODE_180_length_3634_cov_20.444963

Now, let us suppose that we have f1.fastq as follows:

$ cat f1.fastq

@MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCTGAATTACTGGGCTTAAAGCGTACGTAGGCGGATAGGAAAGTTGGGGGTGAAATCCCAG
+
DDDDCFFFFCCDGGGGGGGGGGGGHHHHHHHHHHHHHHHGGGGGHGHHHHHHHHHHHHHHHGGGGGGHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHGGGGGHGGHGHHHHHHHHHHHHHHHHHHHHHHHHGHHGGGGGGGHHHHHGHGGHHHGHGHHGGGGFGGHHHHGGGGGGGGGFFFFFFFFFF0;:;FFFFFFFFF/:;FFFFFFFFFFFFFFFFF:.;;BBFFFFFFFFFFFFFFFFFFFF
@MSQ-M01442:38:000000000-A5H4M:1:1102:12736:13241 1:N:0:CGAGGCTGAAGGAGTA
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCGGGGAGGAAGGCGACAAGGTTAATAACCTTGTCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGCAATACGGAGGGTGCAAGCGTTAATCGGAAT
+
AAAA1>>1AD@1AE?GFBF0C0FGHHHF21AFGHFGCF/EEGHFGEHHHHHFDA1AAGFB@EEGEEEHHHFE1GE>0/FCEEGFEFFGGHFDHFHHFFGDDFG?CC@ACCG0<..<AC@-.C0<<0DD0<<0<<;C.GGHGE::;.C.C9AA?G?GFBFGGBBA9-A>BB/9FBAAAB-FE-B-F-9@99-;-:FB?A--=@-9-/BFEFA9AEBB-ABAE
@MSQ-M01442:38:000000000-A5H4M:1:1108:11468:15061 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTGCACAATGGGCGGAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGTTTGTGACGGTACCTGCAGAAGAAGCGCCGGCCAACTACGTGCCAGCAGCCGCGGTAAGACGTAGGGCGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGCTTGTCGCGTCGACTGTGAAAA
+
CCCCDFFFFCBCGGGGGGGGGGGGHHHHHHHHHHHHHHHGGGGGHHHHHHHHHHHGGGGGGGGGGGGGHGGHGHHGGGGGHHHGGHGGGHHHHHHHHHHHHHHHHHGGGGGGGGGGGGGHHHHGGGHGHHHHHHHHHHHHGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDFFFFFFFFFFFFFHHFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@MSQ-M01442:38:000000000-A5H4M:1:2104:20735:20554 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGGGAAGAAGGCCTTCGGGTTGTAAACCGCTTTTGTCAGGGAAGAAATCCTTTGAGTTAATACCTCGGAGGGATGACGGTACCTGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
+
ABBAAD@FFABBCEECCGGGGGEEFFHHFCHFFHFHFHFGEGGGFAGHHFHGHHFGHHHFFGGGEEGHGGHFGHHHHHHHFHHEEFFGGHFFHEEGGDGHHHHHHBCFHDHHFGGHGHHHGFFFFHHHHHGHGGGCC?FHGHHGCGHGHGHHHHHHHHGHHHHECCGGGGGGGGGGGGGGGGGGGGG@GBFFFFFFFF9FDFBEDFFFFFFFBFFFFFFFFFFFFFFFF?BBFEFFB;DF<BBBBFBB9@
@MSQ-M01442:38:000000000-A5H4M:1:2114:25736:16989 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATCCCGCGTGTGCGATGAAGGCCTTCGGGTTGTAAAGCACTTTTGGCAGGAAAGAAACGGCACGGGCTAATATCCTGTGCAACTGTCGGTACCTGCAGAATAAGCACCGGCTAACTACGTGCCAGCAGCTGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCTTGCGCAGGCGGTT
+
BBCBBFFFFCCCGGGGGGGGGGGGHHHHHGHHHHHHHHGGGGGHHGHHHHHHHHHGHHHHHGGGGGGHHGGHGHHHHHHHHHHGGHGGGHHHHHHHHHHHHGHHGGFHHHHHHHGGGGGGGGGG0>>GHHHHHHHHHHHHH==EGGHGHHHHHHHHHHHHHHHH:;ADGBFGGGGGGGGGGG/:;/;9AB;AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.ACDCFEFF00;ADFFFFFFFFF

and list of IDs as follows:

$ cat IDs.txt

MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
MSQ-M01442:38:000000000-A5H4M:1:2114:25736:16989 1:N:0:CGAGGCTGAAGGAGTA

Designing PCRE for FASTQ records is much simpler as we know that each FASTQ record spans 4 lines and there is no need to activate PCRE_DOTALL using (?s) nor positive look-ahead assertion. Now run the awk portion of the one-liner for FASTQ:

$ cat IDs.txt | awk '{gsub("_","\\_",$0);$0="^@"$0".*?(\\n.*){3}"}1'

^@MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA.*?(\n.*){3}
^@MSQ-M01442:38:000000000-A5H4M:1:2114:25736:16989 1:N:0:CGAGGCTGAAGGAGTA.*?(\n.*){3}

Here,

Using the IDs in IDs.txt, we can grep the records as follows:

$ cat IDs.txt | awk '{gsub("_","\\_",$0);$0="^@"$0".*?(\\n.*){3}"}1' | pcregrep -oM -f - f1.fastq

@MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCTGAATTACTGGGCTTAAAGCGTACGTAGGCGGATAGGAAAGTTGGGGGTGAAATCCCAG
+
DDDDCFFFFCCDGGGGGGGGGGGGHHHHHHHHHHHHHHHGGGGGHGHHHHHHHHHHHHHHHGGGGGGHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHGGGGGHGGHGHHHHHHHHHHHHHHHHHHHHHHHHGHHGGGGGGGHHHHHGHGGHHHGHGHHGGGGFGGHHHHGGGGGGGGGFFFFFFFFFF0;:;FFFFFFFFF/:;FFFFFFFFFFFFFFFFF:.;;BBFFFFFFFFFFFFFFFFFFFF
@MSQ-M01442:38:000000000-A5H4M:1:2114:25736:16989 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATCCCGCGTGTGCGATGAAGGCCTTCGGGTTGTAAAGCACTTTTGGCAGGAAAGAAACGGCACGGGCTAATATCCTGTGCAACTGTCGGTACCTGCAGAATAAGCACCGGCTAACTACGTGCCAGCAGCTGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCTTGCGCAGGCGGTT
+
BBCBBFFFFCCCGGGGGGGGGGGGHHHHHGHHHHHHHHGGGGGHHGHHHHHHHHHGHHHHHGGGGGGHHGGHGHHHHHHHHHHGGHGGGHHHHHHHHHHHHGHHGGFHHHHHHHGGGGGGGGGG0>>GHHHHHHHHHHHHH==EGGHGHHHHHHHHHHHHHHHH:;ADGBFGGGGGGGGGGG/:;/;9AB;AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.ACDCFEFF00;ADFFFFFFFFF

We can similarly make an alias as before

$ alias FASTQgrep="awk '{gsub(\"_\",\"\\\_\",\$0);\$0=\"^@\"\$0\".*?(\\\n.*){3}\"}1' | pcregrep -oM -f -"
$ cat IDs.txt | FASTQgrep f1.fastq

@MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCTGAATTACTGGGCTTAAAGCGTACGTAGGCGGATAGGAAAGTTGGGGGTGAAATCCCAG
+
DDDDCFFFFCCDGGGGGGGGGGGGHHHHHHHHHHHHHHHGGGGGHGHHHHHHHHHHHHHHHGGGGGGHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHGGGGGHGGHGHHHHHHHHHHHHHHHHHHHHHHHHGHHGGGGGGGHHHHHGHGGHHHGHGHHGGGGFGGHHHHGGGGGGGGGFFFFFFFFFF0;:;FFFFFFFFF/:;FFFFFFFFFFFFFFFFF:.;;BBFFFFFFFFFFFFFFFFFFFF
@MSQ-M01442:38:000000000-A5H4M:1:2114:25736:16989 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATCCCGCGTGTGCGATGAAGGCCTTCGGGTTGTAAAGCACTTTTGGCAGGAAAGAAACGGCACGGGCTAATATCCTGTGCAACTGTCGGTACCTGCAGAATAAGCACCGGCTAACTACGTGCCAGCAGCTGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCTTGCGCAGGCGGTT
+
BBCBBFFFFCCCGGGGGGGGGGGGHHHHHGHHHHHHHHGGGGGHHGHHHHHHHHHGHHHHHGGGGGGHHGGHGHHHHHHHHHHGGHGGGHHHHHHHHHHHHGHHGGFHHHHHHHGGGGGGGGGG0>>GHHHHHHHHHHHHH==EGGHGHHHHHHHHHHHHHHHH:;ADGBFGGGGGGGGGGG/:;/;9AB;AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.ACDCFEFF00;ADFFFFFFFFF

If we want to match those MISEQ IDs that come from tile 1102 or 1108, we can use the following:

$ echo "MSQ-M01442:38:000000000-A5H4M:1:110(2|8)" | FASTQgrep f1.fastq

@MSQ-M01442:38:000000000-A5H4M:1:1102:12736:13241 1:N:0:CGAGGCTGAAGGAGTA
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCGGGGAGGAAGGCGACAAGGTTAATAACCTTGTCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGCAATACGGAGGGTGCAAGCGTTAATCGGAAT
+
AAAA1>>1AD@1AE?GFBF0C0FGHHHF21AFGHFGCF/EEGHFGEHHHHHFDA1AAGFB@EEGEEEHHHFE1GE>0/FCEEGFEFFGGHFDHFHHFFGDDFG?CC@ACCG0<..<AC@-.C0<<0DD0<<0<<;C.GGHGE::;.C.C9AA?G?GFBFGGBBA9-A>BB/9FBAAAB-FE-B-F-9@99-;-:FB?A--=@-9-/BFEFA9AEBB-ABAE
@MSQ-M01442:38:000000000-A5H4M:1:1108:11468:15061 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTGCACAATGGGCGGAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGTTTGTGACGGTACCTGCAGAAGAAGCGCCGGCCAACTACGTGCCAGCAGCCGCGGTAAGACGTAGGGCGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGCTTGTCGCGTCGACTGTGAAAA
+
CCCCDFFFFCBCGGGGGGGGGGGGHHHHHHHHHHHHHHHGGGGGHHHHHHHHHHHGGGGGGGGGGGGGHGGHGHHGGGGGHHHGGHGGGHHHHHHHHHHHHHHHHHGGGGGGGGGGGGGHHHHGGGHGHHHHHHHHHHHHGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDFFFFFFFFFFFFFHHFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

Last Updated by Dr Umer Zeeshan Ijaz on 16/04/2014.