To extract subset of records from FASTA/FASTQ files based on IDs provided in IDs.txt
, use the following one-liners:
FASTA file:
$ cat IDs.txt | awk '{gsub("_","\\_",$0);$0="(?s)^>"$0".*?(?=\\n(\\z|>))"}1' | pcregrep -oM -f - file.fasta
FASTQ file:
$ cat IDs.txt | awk '{gsub("_","\\_",$0);$0="^@"$0".*?(\\n.*){3}"}1' | pcregrep -oM -f - file.fastq
Deconstructing the code:
We are using pcregrep
which searches for character patterns in the same way as other grep
commands do, but with PCRE
regular expression library to support patterns that are compatible with the regular expressions of Perl
5. The advantage is to use -M
, --multiline
switch that allows patterns to match more than one line. With this option, patterns may contain literal newline characters and internal occurrences of ^
and $
characters. However, there is a limit to the number of lines that can be matched, imposed by the way that pcregrep
buffers the input file as it scans it. pcregrep
ensures that atleast 8K
characters (in newer versions 20K
) or the rest of the document (whichever is the shorter) are available for forward matching, and similarly the previous 8K
characters (or all the previous characters, if fewer than 8K
) are guaranteed to be available for look-behind assertions. This 8K
or 20K
buffer size limit is sufficient for pulling out shorter reads, however, for larger contigs assembled from assembly software, the newer version of pcregrep
(tested on version 8.35) should be installed and buffer size manipulated with --buffer-size
switch (as can be seen later). With -o
switch, we only show the part of the line that matches the pattern. -f filename
, --file=filename
reads a pattern from the file, one per line, and matches them against each line of the input. To read input from STDIN, we are using it as -f -
.
(?=pattern)
Positive + Look-ahead
$ echo "@MSQ-M01442:38:000000000-A5H4M:1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA" | grep -Po '.*\-(?=A5H4M)'
@MSQ-M01442:38:000000000-
The above matches .*\_
followed by A5H4M
, without including A5H4M
in $&
(?!pattern)
Negative + Look-aheadSay we want to capture everything up to and including first hyphen in the echoed string
$ echo "@MSQ-M01442:38:000000000-A5H4M:1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA" | grep -Po '.*\-(?!A5H4M)'
@MSQ-
The above will match .*\_
that isn't followed by A5H4M
.
(?<=pattern)
or \K
Positive + Look-behind$ echo "@MSQ-M01442:38:000000000-A5H4M:1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA" | grep -Po '(?<=A5H4M:)1:.*' 1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA $ echo "@MSQ-M01442:38:000000000-A5H4M:1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA" | grep -Po 'A5H4M:\K1:.*' 1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA
The above will match 1:.*
following A5H4M:
, without including A5H4M:
in $&
(?<!pattern)
Negative + Look-behindSay we want the next match for 1:.*
(after the previous one)
$ echo "@MSQ-M01442:38:000000000-A5H4M:1:1101:10737:16051 1:N:0:CGAGGCTGAAGGAGTA" | grep -Po '(?<!A5H4M:)1:.*'
1:10737:16051 1:N:0:CGAGGCTGAAGGAGTA
The above will match 1:.*
that doesn't follow A5H4M:
.
Now let us see what the code does for a given FASTA file:
$ cat f1.fasta
>F1
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
>F2
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
>F3
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
>F4_C1
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
>F5_C2
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
It should be immediately clear that the FASTA file is not linearised, i.e., the sequences may span multiple lines. Now let us extract some FASTA IDs:
$ grep -Po '(?<=^>).*' f1.fasta | sort -R | head -2 > IDs.txt
$ cat IDs.txt
F1
F5_C2
Where -P
enables PCRE
in grep
, sort -R
shuffles the headers, and we save the first two IDs using head -2
. Now run the awk
portion of the one-liner for FASTA:
$ cat IDs.txt | awk '{gsub("_","\\_",$0);$0="(?s)^>"$0".*?(?=\\n(\\z|>))"}1'
(?s)^>F1.*?(?=\n(\z|>))
(?s)^>F5\_C2.*?(?=\n(\z|>))
Here,
awk
generates PCRE
that will later be used in pcregrep
to extract the records from FASTA filegsub("_","\\_",$0)
in awk
replaces _
by \_
as _
is a special character that has to be blackslashed$0="(?s)^>"$0".*?(?=\\n(\\z|>))
encapsulates the IDs within regular expressions(?s)
activates PCRE_DOTALL
, which means that .
finds any character or newline (You can use \N
that finds anything except newline, even with PCRE_DOTALL
activated)^>
searches for records starting with >
.*?
finds .
in nongreedy mode, i.e., stops as soon as possible.(?=\\n(\\z|>)
is our positive look ahead which needs to stop the match as soon as next >
is encountered or it is EOF \z
(for the case when ID is for the last record in FASTA file)(\\z|>)
will either match \z
or >
\\
because we are constructing PCRE
stringsUsing the IDs in IDs.txt
, we can then grep the records as follows:
$ cat IDs.txt | awk '{gsub("_","\\_",$0);$0="(?s)^>"$0".*?(?=\\n(\\z|>))"}1' | pcregrep -oM -f - f1.fasta
>F1
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
>F5_C2
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
We can also construct an alias FASTAgrep
and use it instead of the longer one-liner
$ alias FASTAgrep="awk '{gsub(\"_\",\"\\\_\",\$0);\$0=\"(?s)^>\"\$0\".*?(?=\\\n(\\\z|>))\"}1' | pcregrep -oM -f -"
$ cat IDs.txt | FASTAgrep f1.fasta
>F1
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
>F5_C2
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
FASTAgrep
is then the building block for returning FASTA records based on a provided pattern for FASTA header. We can also embed PCRE
in the patterns themselves.
Say we want to return the records for which IDs contain _C
$ echo ".._C" | FASTAgrep f1.fasta
>F4_C1
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
>F5_C2
GTGGCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
AAGTCGGGGGTTAAATCCATGTGCTTAACACATGCAAGGCTTCCGATACTGTAGGTCTAGAGTCTCGAAGTTCCGGTGTA
ACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGAACTGACGCTCAGGCACGAA
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
or if we want to return the records for which IDs do not contain _C
$ echo "..[^_]" | FASTAgrep f1.fasta
>F1
GTGTCAGCCGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGACTTAT
>F2
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
>F3
GTGGCAGCCGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGAATGATTGGGCGTAAAGCGTCCGCAGGTGGCACTGT
AAGTCTGCTGTTAAAGAGCAAGGCTCAACCTTGTAAAGGCAGTGGAAACTACAGAGCTAGAGTACGTTCGTGGTGTAGCG
GTGAAATGCGTAGAGATCAGGAAGAACACCGGTGGCGAAAGCGCTCTGCTAGGCCGTAACTGACACTGAGGGACGAAAGC
An obvious advantage of FASTAgrep
is to filter out contigs generated from assembly software such as Velvet
and MetaVelvet
which store length and coverage information for each contig in it's header, for example, a given contigs.fa
from Velvet
with first few headers is as follows:
$ grep -i ">" contigs.fa | head -20
>NODE_1_length_205_cov_2.678049
>NODE_2_length_221_cov_3.239819
>NODE_3_length_3390_cov_20.385250
>NODE_4_length_164_cov_2.597561
>NODE_5_length_165_cov_4.327273
>NODE_6_length_5675_cov_18.628546
>NODE_7_length_167_cov_3.526946
>NODE_8_length_197_cov_2.903553
>NODE_9_length_141_cov_3.262411
>NODE_10_length_175_cov_3.777143
>NODE_11_length_1365_cov_14.082784
>NODE_12_length_1944_cov_7.141975
>NODE_13_length_1418_cov_18.503527
>NODE_14_length_423_cov_6.451537
>NODE_15_length_143_cov_2.111888
>NODE_16_length_445_cov_20.523596
>NODE_17_length_5129_cov_24.638330
>NODE_18_length_2905_cov_18.701204
>NODE_19_length_323_cov_3.560371
>NODE_20_length_6239_cov_17.427153
As an example, to filter out contigs smaller than 1000, we can use the following:
echo "NODE_\d+_length_(\d){4,}_" | FASTAgrep --buffer-size=100000000 contigs.fa | grep -i ">" | head -20
>NODE_3_length_3390_cov_20.385250
>NODE_6_length_5675_cov_18.628546
>NODE_11_length_1365_cov_14.082784
>NODE_12_length_1944_cov_7.141975
>NODE_13_length_1418_cov_18.503527
>NODE_17_length_5129_cov_24.638330
>NODE_18_length_2905_cov_18.701204
>NODE_20_length_6239_cov_17.427153
>NODE_22_length_4091_cov_19.720362
>NODE_24_length_4513_cov_14.317084
>NODE_26_length_9610_cov_19.642977
>NODE_28_length_2442_cov_21.612612
>NODE_30_length_8253_cov_18.978554
>NODE_31_length_15435_cov_19.081308
>NODE_33_length_1584_cov_22.407827
>NODE_34_length_1911_cov_15.783360
>NODE_35_length_1834_cov_16.521811
>NODE_37_length_4253_cov_18.201740
>NODE_40_length_3255_cov_17.576958
>NODE_41_length_6098_cov_17.564283
or to filter out contigs with coverage less than 20, we can use the following:
$ echo "NODE_\d+_length_\d+_cov_((\d){3,}|[2-9]\d)\." | FASTAgrep --buffer-size=100000000 contigs.fa | grep -i ">" | head -20
>NODE_3_length_3390_cov_20.385250
>NODE_16_length_445_cov_20.523596
>NODE_17_length_5129_cov_24.638330
>NODE_27_length_777_cov_30.042471
>NODE_28_length_2442_cov_21.612612
>NODE_33_length_1584_cov_22.407827
>NODE_60_length_15251_cov_67.381416
>NODE_65_length_1170_cov_21.969231
>NODE_86_length_8404_cov_74.705971
>NODE_98_length_8355_cov_23.382885
>NODE_111_length_910_cov_21.128571
>NODE_125_length_3085_cov_22.705349
>NODE_153_length_8307_cov_81.449623
>NODE_161_length_530_cov_46.290565
>NODE_162_length_443_cov_31.598194
>NODE_163_length_2228_cov_21.039946
>NODE_166_length_2736_cov_20.957237
>NODE_175_length_7338_cov_22.985283
>NODE_179_length_831_cov_26.533092
>NODE_180_length_3634_cov_20.444963
Now, let us suppose that we have f1.fastq
as follows:
$ cat f1.fastq
@MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCTGAATTACTGGGCTTAAAGCGTACGTAGGCGGATAGGAAAGTTGGGGGTGAAATCCCAG
+
DDDDCFFFFCCDGGGGGGGGGGGGHHHHHHHHHHHHHHHGGGGGHGHHHHHHHHHHHHHHHGGGGGGHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHGGGGGHGGHGHHHHHHHHHHHHHHHHHHHHHHHHGHHGGGGGGGHHHHHGHGGHHHGHGHHGGGGFGGHHHHGGGGGGGGGFFFFFFFFFF0;:;FFFFFFFFF/:;FFFFFFFFFFFFFFFFF:.;;BBFFFFFFFFFFFFFFFFFFFF
@MSQ-M01442:38:000000000-A5H4M:1:1102:12736:13241 1:N:0:CGAGGCTGAAGGAGTA
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCGGGGAGGAAGGCGACAAGGTTAATAACCTTGTCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGCAATACGGAGGGTGCAAGCGTTAATCGGAAT
+
AAAA1>>1AD@1AE?GFBF0C0FGHHHF21AFGHFGCF/EEGHFGEHHHHHFDA1AAGFB@EEGEEEHHHFE1GE>0/FCEEGFEFFGGHFDHFHHFFGDDFG?CC@ACCG0<..<AC@-.C0<<0DD0<<0<<;C.GGHGE::;.C.C9AA?G?GFBFGGBBA9-A>BB/9FBAAAB-FE-B-F-9@99-;-:FB?A--=@-9-/BFEFA9AEBB-ABAE
@MSQ-M01442:38:000000000-A5H4M:1:1108:11468:15061 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTGCACAATGGGCGGAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGTTTGTGACGGTACCTGCAGAAGAAGCGCCGGCCAACTACGTGCCAGCAGCCGCGGTAAGACGTAGGGCGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGCTTGTCGCGTCGACTGTGAAAA
+
CCCCDFFFFCBCGGGGGGGGGGGGHHHHHHHHHHHHHHHGGGGGHHHHHHHHHHHGGGGGGGGGGGGGHGGHGHHGGGGGHHHGGHGGGHHHHHHHHHHHHHHHHHGGGGGGGGGGGGGHHHHGGGHGHHHHHHHHHHHHGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDFFFFFFFFFFFFFHHFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@MSQ-M01442:38:000000000-A5H4M:1:2104:20735:20554 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGGGAAGAAGGCCTTCGGGTTGTAAACCGCTTTTGTCAGGGAAGAAATCCTTTGAGTTAATACCTCGGAGGGATGACGGTACCTGAAGAATAAGCACCGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTT
+
ABBAAD@FFABBCEECCGGGGGEEFFHHFCHFFHFHFHFGEGGGFAGHHFHGHHFGHHHFFGGGEEGHGGHFGHHHHHHHFHHEEFFGGHFFHEEGGDGHHHHHHBCFHDHHFGGHGHHHGFFFFHHHHHGHGGGCC?FHGHHGCGHGHGHHHHHHHHGHHHHECCGGGGGGGGGGGGGGGGGGGGG@GBFFFFFFFF9FDFBEDFFFFFFFBFFFFFFFFFFFFFFFF?BBFEFFB;DF<BBBBFBB9@
@MSQ-M01442:38:000000000-A5H4M:1:2114:25736:16989 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATCCCGCGTGTGCGATGAAGGCCTTCGGGTTGTAAAGCACTTTTGGCAGGAAAGAAACGGCACGGGCTAATATCCTGTGCAACTGTCGGTACCTGCAGAATAAGCACCGGCTAACTACGTGCCAGCAGCTGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCTTGCGCAGGCGGTT
+
BBCBBFFFFCCCGGGGGGGGGGGGHHHHHGHHHHHHHHGGGGGHHGHHHHHHHHHGHHHHHGGGGGGHHGGHGHHHHHHHHHHGGHGGGHHHHHHHHHHHHGHHGGFHHHHHHHGGGGGGGGGG0>>GHHHHHHHHHHHHH==EGGHGHHHHHHHHHHHHHHHH:;ADGBFGGGGGGGGGGG/:;/;9AB;AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.ACDCFEFF00;ADFFFFFFFFF
and list of IDs as follows:
$ cat IDs.txt
MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
MSQ-M01442:38:000000000-A5H4M:1:2114:25736:16989 1:N:0:CGAGGCTGAAGGAGTA
Designing PCRE
for FASTQ records is much simpler as we know that each FASTQ record spans 4 lines and there is no need to activate PCRE_DOTALL
using (?s)
nor positive look-ahead assertion. Now run the awk portion of the one-liner for FASTQ:
$ cat IDs.txt | awk '{gsub("_","\\_",$0);$0="^@"$0".*?(\\n.*){3}"}1'
^@MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA.*?(\n.*){3}
^@MSQ-M01442:38:000000000-A5H4M:1:2114:25736:16989 1:N:0:CGAGGCTGAAGGAGTA.*?(\n.*){3}
Here,
$0="^@"$0".*?(\\n.*){3}"
encapsulates the IDs within regular expressions^@
searches for records starting with @
(\\n.*){3}
extracts next three lines starting with the new line character in the headerUsing the IDs in IDs.txt
, we can grep the records as follows:
$ cat IDs.txt | awk '{gsub("_","\\_",$0);$0="^@"$0".*?(\\n.*){3}"}1' | pcregrep -oM -f - f1.fastq
@MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCTGAATTACTGGGCTTAAAGCGTACGTAGGCGGATAGGAAAGTTGGGGGTGAAATCCCAG
+
DDDDCFFFFCCDGGGGGGGGGGGGHHHHHHHHHHHHHHHGGGGGHGHHHHHHHHHHHHHHHGGGGGGHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHGGGGGHGGHGHHHHHHHHHHHHHHHHHHHHHHHHGHHGGGGGGGHHHHHGHGGHHHGHGHHGGGGFGGHHHHGGGGGGGGGFFFFFFFFFF0;:;FFFFFFFFF/:;FFFFFFFFFFFFFFFFF:.;;BBFFFFFFFFFFFFFFFFFFFF
@MSQ-M01442:38:000000000-A5H4M:1:2114:25736:16989 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATCCCGCGTGTGCGATGAAGGCCTTCGGGTTGTAAAGCACTTTTGGCAGGAAAGAAACGGCACGGGCTAATATCCTGTGCAACTGTCGGTACCTGCAGAATAAGCACCGGCTAACTACGTGCCAGCAGCTGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCTTGCGCAGGCGGTT
+
BBCBBFFFFCCCGGGGGGGGGGGGHHHHHGHHHHHHHHGGGGGHHGHHHHHHHHHGHHHHHGGGGGGHHGGHGHHHHHHHHHHGGHGGGHHHHHHHHHHHHGHHGGFHHHHHHHGGGGGGGGGG0>>GHHHHHHHHHHHHH==EGGHGHHHHHHHHHHHHHHHH:;ADGBFGGGGGGGGGGG/:;/;9AB;AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.ACDCFEFF00;ADFFFFFFFFF
We can similarly make an alias as before
$ alias FASTQgrep="awk '{gsub(\"_\",\"\\\_\",\$0);\$0=\"^@\"\$0\".*?(\\\n.*){3}\"}1' | pcregrep -oM -f -"
$ cat IDs.txt | FASTQgrep f1.fastq
@MSQ-M01442:38:000000000-A5H4M:1:2110:20073:12068 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGCGCAAGCCTGATCTAGCCATGCCGCGTGTGTGATGAAGGTCTTAGGATCGTAAAGCACTTTCGCCAGGGATGATAATGACAGTACCTGGTAAAGAAACCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGTTAGCGTTGTTCTGAATTACTGGGCTTAAAGCGTACGTAGGCGGATAGGAAAGTTGGGGGTGAAATCCCAG
+
DDDDCFFFFCCDGGGGGGGGGGGGHHHHHHHHHHHHHHHGGGGGHGHHHHHHHHHHHHHHHGGGGGGHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHHGGGGGHGGHGHHHHHHHHHHHHHHHHHHHHHHHHGHHGGGGGGGHHHHHGHGGHHHGHGHHGGGGFGGHHHHGGGGGGGGGFFFFFFFFFF0;:;FFFFFFFFF/:;FFFFFFFFFFFFFFFFF:.;;BBFFFFFFFFFFFFFFFFFFFF
@MSQ-M01442:38:000000000-A5H4M:1:2114:25736:16989 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATTTTGGACAATGGGGGCAACCCTGATCCAGCCATCCCGCGTGTGCGATGAAGGCCTTCGGGTTGTAAAGCACTTTTGGCAGGAAAGAAACGGCACGGGCTAATATCCTGTGCAACTGTCGGTACCTGCAGAATAAGCACCGGCTAACTACGTGCCAGCAGCTGCGGTAATACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCTTGCGCAGGCGGTT
+
BBCBBFFFFCCCGGGGGGGGGGGGHHHHHGHHHHHHHHGGGGGHHGHHHHHHHHHGHHHHHGGGGGGHHGGHGHHHHHHHHHHGGHGGGHHHHHHHHHHHHGHHGGFHHHHHHHGGGGGGGGGG0>>GHHHHHHHHHHHHH==EGGHGHHHHHHHHHHHHHHHH:;ADGBFGGGGGGGGGGG/:;/;9AB;AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.ACDCFEFF00;ADFFFFFFFFF
If we want to match those MISEQ IDs that come from tile 1102 or 1108, we can use the following:
$ echo "MSQ-M01442:38:000000000-A5H4M:1:110(2|8)" | FASTQgrep f1.fastq
@MSQ-M01442:38:000000000-A5H4M:1:1102:12736:13241 1:N:0:CGAGGCTGAAGGAGTA
CCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGGGAAACCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTCGGGTTGTAAAGCACTTTCAGCGGGGAGGAAGGCGACAAGGTTAATAACCTTGTCGATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGCAATACGGAGGGTGCAAGCGTTAATCGGAAT
+
AAAA1>>1AD@1AE?GFBF0C0FGHHHF21AFGHFGCF/EEGHFGEHHHHHFDA1AAGFB@EEGEEEHHHFE1GE>0/FCEEGFEFFGGHFDHFHHFFGDDFG?CC@ACCG0<..<AC@-.C0<<0DD0<<0<<;C.GGHGE::;.C.C9AA?G?GFBFGGBBA9-A>BB/9FBAAAB-FE-B-F-9@99-;-:FB?A--=@-9-/BFEFA9AEBB-ABAE
@MSQ-M01442:38:000000000-A5H4M:1:1108:11468:15061 1:N:0:CGAGGCTGAAGGAGTA
CCTATGGGAGGCAGCAGTGGGGAATCTTGCACAATGGGCGGAAGCCTGATGCAGCGACGCCGCGTGAGGGATGACGGCCTTCGGGTTGTAAACCTCTTTCAGCAGGGACGAAGCGTTTGTGACGGTACCTGCAGAAGAAGCGCCGGCCAACTACGTGCCAGCAGCCGCGGTAAGACGTAGGGCGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGAGCTCGTAGGCGGCTTGTCGCGTCGACTGTGAAAA
+
CCCCDFFFFCBCGGGGGGGGGGGGHHHHHHHHHHHHHHHGGGGGHHHHHHHHHHHGGGGGGGGGGGGGHGGHGHHGGGGGHHHGGHGGGHHHHHHHHHHHHHHHHHGGGGGGGGGGGGGHHHHGGGHGHHHHHHHHHHHHGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFDFFFFFFFFFFFFFHHFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF