Speeding Up Grep Searches¶
Sometimes you may find yourself needing to filter a large amount of output
using the grep
command. However, grep
can sometimes struggle when you try
to filter files with an incredibly large number of lines, as it loads each line
into RAM line-by-line. This can mean you can quickly exhaust even large amounts
of requested RAM. There are a few ways around this.
Using the C
locale instead of UTF-8
¶
Prefixing the grep command with LC_ALL=C
(as per this helpful Stack Overflow post) uses the
C
locale instead of UTF-8
and improves job runtimes significantly
(between 20 to 100 times faster). Recently a user was running a grep
command
to filter over 6000 lines of patterns from a 300GB file with over 4 billion
lines using a command similar to:
grep -f patterns.txt file.txt > final.txt
The -f
option obtains patterns from FILE (in this case patterns.txt
), one
per line (for more information see the grep
manpage which can be
viewed on Apocrita using man grep
). So the command above looks for matching
patterns defined in patterns.txt
in the file file.txt
and redirects the
output to final.txt
.
The command above was running for 24 hours and still not finishing. Simply by
using the LC_ALL=C
prefix:
LC_ALL=C grep -f patterns.txt file.txt > final.txt
The job completed in around 15 minutes. If the content of your data is pure
ASCII characters, then LC_ALL=C
should be fine to use.
Interpreting patterns as fixed strings¶
Adding -F
to grep
interprets the patterns to be matched as a list of fixed
strings, instead of regular expressions,
separated by newlines, any of which is to be matched. If you don't require the
use of regular expressions (also know as regex) then this can be a real
improvement as well.
Recently a user was running a grep
command to filter over 470,000 lines of
patterns from a 300GB file with over 53 million lines:
LC_ALL=C grep -f patterns.txt file.txt > final.txt
Despite using LC_ALL=C
to use the C
locale instead of UTF-8
, the job
would use an enormous amount of RAM as the 470,000 lines of patterns in
patterns.txt
were loaded line-by-line. Eventually even a large amount of RAM
such as 256GB would quickly be entirely exhausted and the job would be killed
by the scheduler on Apocrita before it had a chance to finish.
By adding -F
to the grep
command, like so:
LC_ALL=C grep -Ff patterns.txt file.txt > final.txt
the user's job completed in just a few minutes.
Splitting big files into smaller ones¶
If neither of the above tips work and your job is still running out of RAM, you
may need to use split
(available on the cluster without needing to load any
additional modules) to split your files into smaller ones:
$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is 'x'. With no INPUT, or when INPUT
is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N generate suffixes of length N (default 2)
--additional-suffix=SUFFIX append an additional SUFFIX to file names
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file
-d, --numeric-suffixes[=FROM] use numeric suffixes instead of alphabetic;
FROM changes the start value (default 0)
-e, --elide-empty-files do not generate empty output files with '-n'
--filter=COMMAND write to shell COMMAND; file name is $FILE
-l, --lines=NUMBER put NUMBER lines per output file
-n, --number=CHUNKS generate CHUNKS output files; see explanation below
-u, --unbuffered immediately copy input to output with '-n r/...'
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
For example, to split a large patterns file with 400,000 lines into multiple parts, each with 10,000 lines per part:
split --numeric-suffixes -l 10000 patterns.txt patterns.txt_
This would give you 40 files, named with numeric suffixes, like so:
patterns.txt_00
patterns.txt_01
patterns.txt_02
patterns.txt_03
patterns.txt_04
<etc>
You could then use an array job,
perhaps using the
list_of_files.txt
method, to run an array job that concurrently runs your grep
command against
each smaller 10,000 line file, and then combines the output at the end,
something like:
#!/bin/bash
#$ -cwd
#$ -pe smp 1
#$ -l h_rt=240:0:0
#$ -l h_vmem=8G
#$ -N grep_array
#$ -t 1-40
#$ -j y
INPUT_FILE=$(sed -n "${SGE_TASK_ID}p" list_of_patterns.txt)
LC_ALL=C grep -Ff ${INPUT_FILE} file.txt > final.txt.${SGE_TASK_ID}-${INPUT_FILE}.txt
On this occasion, the large 400,000 pattern file was split, but you could also
split file.txt
if it is too large.
We hope you find these tips useful. As usual, you can ask a question on our Slack channel (QMUL users only), or by sending an email to its-research-support@qmul.ac.uk which is handled directly by staff with relevant expertise.
Title image: Generated using DALL-E-2