简体   繁体   English

如何从Java中的多个文档中搜索单词?

[英]How to search a word from multiple documents in java?

My actual requirement is to list all the files in the given directory which contains the search phrase textToMatch in minimum amount of time about 4-5 seconds, where number of files could be upto 100000 or more. 我的实际要求是列出给定目录中的所有文件,其中包含搜索短语textToMatch的最短时间约为4-5秒,其中文件数最多可以达到100000或更多。

I don't want code, just I want a best algorithm for this. 我不需要代码,只是我想要一个最好的算法。

Since you will have to open every file, you can also use a tool build for this specific task. 由于必须打开每个文件,因此也可以使用工具构建此特定任务。 Use grep : 使用grep

We have 100000 files to look at. 我们有100000个文件可供查看。

% ls -l *.txt | wc -l          
100000

They contain Vestibulum . 他们包含Vestibulum

% grep Vestibulum 1.txt        
Aenean commodo ultrices imperdiet. Vestibulum ut justo vel sapien venenatis tincidunt.
euismod ultrices facilisis. Vestibulum porta sapien adipiscing augue congue id pretium lectus

Count the files containing Vestibulum , time this. 计数包含Vestibulum的文件,然后计时。

% time grep -l Vestibulum *.txt | wc -l
100000
grep --color=auto -l Vestibulum *.txt  0,28s user 0,25s system 99% cpu 0,537 total
wc -l  0,00s user 0,01s system 1% cpu 0,537 total

As you see, this takes only have a second on my machine. 如您所见,这在我的计算机上仅需一秒钟。

Your program must deal with 2 issues: 您的程序必须处理2个问题:

  1. Locating each and every file in each and every subdirectory and 在每个子目录中找到每个文件,
  2. Searching for the phrase you need inside every file. 在每个文件中搜索所需的短语。

For 1: You can search the given directory for files either iteratively or recursively or let Java 7 or 8 do the work for you by using either a FileVisitor or Apache Commons IO . 对于1:您可以迭代地或递归地在给定目录中搜​​索文件,或者使用FileVisitorApache Commons IO让Java 7或8为您完成工作。

For 2: You could use a Java Scanner or implement your self the very fast algorithm for searching inside files, called the Boyer-Moore algorithm. 对于2:您可以使用Java扫描仪或自行实现用于搜索内部文件的非常快速的算法,称为Boyer-Moore算法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM