简体   繁体   中英

How to use grep with large (millions) number of files to search for string and get result in few minutes

This question is related to How to use grep efficiently?

I am trying to search for a "string" in a folder which has 8-10 million small (~2-3kb) plain text files. I need to know all the files which has "string".

At first I used this

grep "string"

That was super slow.

Then I tried

grep * "string" {} \; -print

Based on linked question, I used this

 find . | xargs -0 -n1 -P8 grep -H "string"

I get this error:

xargs: argument line too long

Does anyone know a way to accomplish this task relatively quicker?

I run this search on a server machine which has more than 50GB of available RAM, and 14 cores of CPU. I wish somehow I could use all that processing power to run this search faster.

您应该删除-0参数到xargs和up -n参数:

... | xargs -n16 ...

It's not that big stack of files (kudos to 10⁷ files - a messys dream) but I created 100k files (400 MB overall) with

for i in {1..100000}; do head -c 10 /dev/urandom > dummy_$i; done

and made some tests for pure curiosity (the keyword 10 I was searching is chosen randomly):

> time find . | xargs -n1 -P8 grep -H "10"
real 0m22.626s
user 0m0.572s
sys  0m5.800s

> time find . | xargs -n8 -P8 grep -H "10"
real 0m3.195s
user 0m0.180s
sys  0m0.748s

> time grep "10" *
real 0m0.879s
user 0m0.512s
sys  0m0.328s

> time awk '/10/' *
real 0m1.123s
user 0m0.760s
sys  0m0.348s

> time sed -n '/10/p' *
real 0m1.531s
user 0m0.896s
sys  0m0.616s

> time perl -ne 'print if /10/' *
real 0m1.428s
user 0m1.004s
sys  0m0.408s

Btw. there isn't a big difference in running time if I suppress the output with piping STDOUT to /dev/null . I am using Ubuntu 12.04 on a not so powerful laptop ;) My CPU is Intel(R) Core(TM) i3-3110M CPU @ 2.40GHz.

More curiosity:

> time find . | xargs -n1 -P8 grep -H "10" 1>/dev/null

real 0m22.590s
user 0m0.616s
sys  0m5.876s

> time find . | xargs -n4 -P8 grep -H "10" 1>/dev/null

real m5.604s
user 0m0.196s
sys  0m1.488s

> time find . | xargs -n8 -P8 grep -H "10" 1>/dev/null

real 0m2.939s
user 0m0.140s
sys  0m0.784s

> time find . | xargs -n16 -P8 grep -H "10" 1>/dev/null

real 0m1.574s
user 0m0.108s
sys  0m0.428s

> time find . | xargs -n32 -P8 grep -H "10" 1>/dev/null

real 0m0.907s
user 0m0.084s
sys  0m0.264s

> time find . | xargs -n1024 -P8 grep -H "10" 1>/dev/null

real 0m0.245s
user 0m0.136s
sys  0m0.404s

> time find . | xargs -n100000 -P8 grep -H "10" 1>/dev/null

real 0m0.224s
user 0m0.100s
sys  0m0.520s

8 million files is a lot in one directory! However, 8 million times 2kb is 16GB and you have 50GB of RAM. I am thinking of a RAMdisk...

If you've got that much RAM, why not read it all into memory and use a regular expression library to search? It's a simple C program:

    #include <fcntl.h>
    #include <regex.h>
    ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM