How to use grep with large (millions) number of files to search for string and get result in few minutes

Question

This question is related to How to use grep efficiently?

I am trying to search for a "string" in a folder which has 8-10 million small (~2-3kb) plain text files. I need to know all the files which has "string".

At first I used this

grep "string"

That was super slow.

Then I tried

grep * "string" {} \; -print

Based on linked question, I used this

 find . | xargs -0 -n1 -P8 grep -H "string"

I get this error:

xargs: argument line too long

Does anyone know a way to accomplish this task relatively quicker?

I run this search on a server machine which has more than 50GB of available RAM, and 14 cores of CPU. I wish somehow I could use all that processing power to run this search faster.

Answer 1

您应该删除-0参数到xargs和up -n参数：

... | xargs -n16 ...

Answer 2

It's not that big stack of files (kudos to 10⁷ files - a messys dream) but I created 100k files (400 MB overall) with

for i in {1..100000}; do head -c 10 /dev/urandom > dummy_$i; done

and made some tests for pure curiosity (the keyword 10 I was searching is chosen randomly):

> time find . | xargs -n1 -P8 grep -H "10"
real 0m22.626s
user 0m0.572s
sys  0m5.800s

> time find . | xargs -n8 -P8 grep -H "10"
real 0m3.195s
user 0m0.180s
sys  0m0.748s

> time grep "10" *
real 0m0.879s
user 0m0.512s
sys  0m0.328s

> time awk '/10/' *
real 0m1.123s
user 0m0.760s
sys  0m0.348s

> time sed -n '/10/p' *
real 0m1.531s
user 0m0.896s
sys  0m0.616s

> time perl -ne 'print if /10/' *
real 0m1.428s
user 0m1.004s
sys  0m0.408s

Btw. there isn't a big difference in running time if I suppress the output with piping STDOUT to /dev/null . I am using Ubuntu 12.04 on a not so powerful laptop ;) My CPU is Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz.

More curiosity:

> time find . | xargs -n1 -P8 grep -H "10" 1>/dev/null

real 0m22.590s
user 0m0.616s
sys  0m5.876s

> time find . | xargs -n4 -P8 grep -H "10" 1>/dev/null

real m5.604s
user 0m0.196s
sys  0m1.488s

> time find . | xargs -n8 -P8 grep -H "10" 1>/dev/null

real 0m2.939s
user 0m0.140s
sys  0m0.784s

> time find . | xargs -n16 -P8 grep -H "10" 1>/dev/null

real 0m1.574s
user 0m0.108s
sys  0m0.428s

> time find . | xargs -n32 -P8 grep -H "10" 1>/dev/null

real 0m0.907s
user 0m0.084s
sys  0m0.264s

> time find . | xargs -n1024 -P8 grep -H "10" 1>/dev/null

real 0m0.245s
user 0m0.136s
sys  0m0.404s

> time find . | xargs -n100000 -P8 grep -H "10" 1>/dev/null

real 0m0.224s
user 0m0.100s
sys  0m0.520s

Answer 3

8 million files is a lot in one directory! However, 8 million times 2kb is 16GB and you have 50GB of RAM. I am thinking of a RAMdisk...

Answer 4

If you've got that much RAM, why not read it all into memory and use a regular expression library to search? It's a simple C program:

    #include <fcntl.h>
    #include <regex.h>
    ...

How to use grep with large (millions) number of files to search for string and get result in few minutes

Question

4 answers

solution1
13 ACCPTED 2013-10-30 22:08:29

solution2
11 2013-10-30 22:35:07

solution3
0 2013-10-30 21:55:06

solution4
-3 2015-04-27 16:35:29

How to use grep with large (millions) number of files to search for string and get result in few minutes

Question

4 answers

solution1 13 ACCPTED 2013-10-30 22:08:29

solution2 11 2013-10-30 22:35:07

solution3 0 2013-10-30 21:55:06

solution4 -3 2015-04-27 16:35:29

solution1
13 ACCPTED 2013-10-30 22:08:29

solution2
11 2013-10-30 22:35:07

solution3
0 2013-10-30 21:55:06

solution4
-3 2015-04-27 16:35:29