[英]How to search a word from multiple documents in java?
My actual requirement is to list all the files in the given directory which contains the search phrase textToMatch
in minimum amount of time about 4-5
seconds, where number of files could be upto 100000
or more. 我的实际要求是列出给定目录中的所有文件,其中包含搜索短语
textToMatch
的最短时间约为4-5
秒,其中文件数最多可以达到100000
或更多。
I don't want code, just I want a best algorithm for this. 我不需要代码,只是我想要一个最好的算法。
Since you will have to open every file, you can also use a tool build for this specific task. 由于必须打开每个文件,因此也可以使用工具构建此特定任务。 Use
grep
: 使用
grep
:
We have 100000 files to look at. 我们有100000个文件可供查看。
% ls -l *.txt | wc -l
100000
They contain Vestibulum
. 他们包含
Vestibulum
。
% grep Vestibulum 1.txt
Aenean commodo ultrices imperdiet. Vestibulum ut justo vel sapien venenatis tincidunt.
euismod ultrices facilisis. Vestibulum porta sapien adipiscing augue congue id pretium lectus
Count the files containing Vestibulum
, time this. 计数包含
Vestibulum
的文件,然后计时。
% time grep -l Vestibulum *.txt | wc -l
100000
grep --color=auto -l Vestibulum *.txt 0,28s user 0,25s system 99% cpu 0,537 total
wc -l 0,00s user 0,01s system 1% cpu 0,537 total
As you see, this takes only have a second on my machine. 如您所见,这在我的计算机上仅需一秒钟。
Your program must deal with 2 issues: 您的程序必须处理2个问题:
For 1: You can search the given directory for files either iteratively or recursively or let Java 7 or 8 do the work for you by using either a FileVisitor or Apache Commons IO . 对于1:您可以迭代地或递归地在给定目录中搜索文件,或者使用FileVisitor或Apache Commons IO让Java 7或8为您完成工作。
For 2: You could use a Java Scanner or implement your self the very fast algorithm for searching inside files, called the Boyer-Moore algorithm. 对于2:您可以使用Java扫描仪或自行实现用于搜索内部文件的非常快速的算法,称为Boyer-Moore算法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.