简体   繁体   English

在多个文件中搜索文本的最快方法?

[英]Fastest way to search text in multiple files?

I need to find some text in around 120 text files, I want to know which will be the best and fastest way to search text. 我需要在大约120个文本文件中找到一些文本,我想知道哪种方法是搜索文本的最佳和最快方法。 Am I supposed to read each file in a RichTextBox then use its methods to search text or should I be reading those files into a string variable and then searching using regular expressions? 我应该读取RichTextBox中的每个文件,然后使用其方法搜索文本吗?还是应该将这些文件读取为字符串变量,然后使用正则表达式进行搜索?

I think the major factor behind performance is to find a way so that there is no need to loop through the lines which have already been tested for match. 我认为,性能背后的主要因素是找到一种方法,从而无需遍历已经测试过匹配的线路。 Is there any way to find all the matches in a file in one go? 有什么方法可以一次找到文件中的所有匹配项? Does anyone knows a way to find matches in text files as Visual Studio does? 有谁知道像Visual Studio一样在文本文件中查找匹配项的方法? It searched 200 text files for match in around 800-1000 Milliseconds. 它在大约800-1000毫秒内搜索了200个文本文件以进行匹配。 I think it makes use of multiple threads to accomplish this. 我认为它利用多个线程来完成此任务。

From your description (120 files, 70K-80K words, 1-2 MB per file), it would seem the best approach is to read the files once and build an index that can be searched. 从您的描述(120个文件,70K-80K字,每个文件1-2 MB)来看,最好的方法似乎是一次读取文件并建立可以搜索的索引。 I have included an example below to illustrate how such a thing can be done, but that may be of limited use to you if you need more complex search term matching than finding an exact term or a prefixed term. 我在下面提供了一个示例来说明如何完成此操作,但是如果您需要比查找精确术语或前缀术语更复杂的搜索词匹配功能,那么这可能对您的作用有限。

If you need more complex text search matching (while getting good performance), I would advise you to look into the excellent Lucene library, that was built specifically for this purpose. 如果您需要更复杂的文本搜索匹配(同时获得良好的性能),我建议您研究一下专门为此目的而构建的出色的Lucene库。

public struct WordLocation
{
    public WordLocation(string fileName, int lineNumber, int wordIndex)
    {
        FileName = fileName;
        LineNumber = lineNumber;
        WordIndex = wordIndex;
    }
    public readonly string FileName; // file containing the word.
    public readonly int LineNumber;  // line within the file.
    public readonly int WordIndex;   // index within the line.
}

public struct WordOccurrences
{
    private WordOccurrences(int nOccurrences, WordLocation[] locations)
    {
        NumberOfOccurrences = nOccurrences;
        Locations = locations;
    }

    public static readonly WordOccurrences None = new WordOccurrences(0, new WordLocation[0]);

    public static WordOccurrences FirstOccurrence(string fileName, int lineNumber, int wordIndex)
    {
        return new WordOccurrences(1, new [] { new WordLocation(fileName, lineNumber, wordIndex) });
    }

    public WordOccurances AddOccurrence(string fileName, int lineNumber, int wordIndex)
    {
        return new WordOccurrences(
            NumberOfOccurrences + 1, 
            Locations
                .Concat(
                    new [] { new WordLocation(fileName, lineNumber, wordIndex) })
                .ToArray());
    }

    public readonly int NumberOfOccurrences;
    public readonly WordLocation[] Locations;
}

public interface IWordIndexBuilder
{
    void AddWordOccurrence(string word, string fileName, int lineNumber, int wordIndex);
    IWordIndex Build();
}

public interface IWordIndex
{
    WordOccurrences Find(string word);
}

public static class BuilderExtensions
{
    public static IWordIndex BuildIndexFromFiles(this IWordIndexBuilder builder, IEnumerable<FileInfo> wordFiles)
    {
        var wordSeparators = new char[] {',', ' ', '\t', ';' /* etc */ };
        foreach (var file in wordFiles)
        {
            var lineNumber = 1;
            using (var reader = file.OpenText())
            {
                while (!reader.EndOfStream)
                {
                    var words = reader
                         .ReadLine() 
                         .Split(wordSeparators, StringSplitOptions.RemoveEmptyEntries)
                         .Select(f => f.Trim());

                    var wordIndex = 1;
                    foreach (var word in words)
                        builder.AddWordOccurrence(word, file.FullName, lineNumber, wordIndex++);

                    lineNumber++;
                }
            }
        }
        return builder.Build();
    }
}

Then the simplest index implementation (that can only do an exact match lookup) uses a dictionary internally: 然后,最简单的索引实现(只能执行精确匹配查找)在内部使用字典:

public class DictionaryIndexBuilder : IIndexBuilder
{
    private Dictionary<string, WordOccurrences> _dict;

    private class DictionaryIndex : IWordIndex 
    {
        private readonly Dictionary<string, WordOccurrences> _dict;

        public DictionaryIndex(Dictionary<string, WordOccurrences> dict)
        {
            _dict = dict;
        }
        public WordOccurrences Find(string word)
        {
           WordOccurrences found;
           if (_dict.TryGetValue(word, out found);
               return found;
           return WordOccurrences.None;
        }
    }

    public DictionaryIndexBuilder(IEqualityComparer<string> comparer)
    {
        _dict = new Dictionary<string, WordOccurrences>(comparer);
    }
    public void AddWordOccurrence(string word, string fileName, int lineNumber, int wordIndex)
    {
        WordOccurrences current;
        if (!_dict.TryGetValue(word, out current))
            _dict[word] = WordOccurrences.FirstOccurrence(fileName, lineNumber, wordIndex);
        else
            _dict[word] = current.AddOccurrence(fileName, lineNumber, wordIndex);
    }
    public IWordIndex Build()
    {
        var dict = _dict;
        _dict = null;
        return new DictionaryIndex(dict);
    }
}

Usage: 用法:

var builder = new DictionaryIndexBuilder(EqualityComparer<string>.Default);
var index = builder.BuildIndexFromFiles(myListOfFiles);
var matchSocks = index.Find("Socks");

If you also want to do prefix lookups, implement an index builder/index class that uses a sorted dictionary (and change the IWordIndex.Find method to return multiple matches, or add a new method to the interface for finding partial/pattern matches). 如果您还想进行前缀查找,请实现使用排序字典的索引构建器/索引类(并更改IWordIndex.Find方法以返回多个匹配项,或者向接口添加新方法以查找部分/模式匹配项)。

If you want to do more complex lookups, go for something like Lucence. 如果要进行更复杂的查找,请选择诸如Lucence之类的东西。

Here what I will if I where you: 如果您在这里,我会怎么做:

1- I'll load all files path to a list of string. 1-我将所有文件路径加载到字符串列表。

2- I'll create a new list to store files path that match my search term. 2-我将创建一个新列表来存储与我的搜索词匹配的文件路径。

3- I'll loop foreach in Files list and search for my term then I'll add the matched file to the new list. 3-我将在文件列表中循环foreach并搜索我的术语,然后将匹配的文件添加到新列表中。

string searchTerm = "Some terms";
    string[] MyFilesList = Directory.GetFiles(@"c:\txtDirPath\", "*.txt");
    List<string> FoundedSearch=new List<string>();
    foreach (string filename in MyFilesList)
    {
        string textFile = File.ReadAllText(filename);
        if (textFile.Contains(searchTerm))
        {
            FoundedSearch.Add(filename);
        }
    }

then you can deal with the List:FoundedSearch what ever you want. 那么您可以随心所欲地处理List:FoundedSearch。

by the way: 顺便说说:

I don't know the best answer but the performance will be very well until 800 text file with 1000 word per file you can find the performance pretty well with this chart 我不知道最佳答案,但是性能会非常好,直到800个文本文件(每个文件1000个单词),您可以在此图表中找到很好的性能

I assume you need to search every file for the same string. 我假设您需要在每个文件中搜索相同的字符串。 You could use a compiled regex for each search. 您可以为每个搜索使用已compiled regex

string searchTerm = "searchWord";
Regex rx = new Regex(String.Format("\b{0}\b", searchTerm), RegexOptions.Compiled);
List<string> filePaths = new List<string>();

foreach (string filePath in filePaths)
{
   string allText = File.ReadAllText(filePath);
   var matches = rx.Matches(allText);             
   //rest of code
}

You'd have to benchmark performance but I imagine the major bottleneck will be the opening and reading of the file from the disk. 您必须对性能进行基准测试,但我想主要的瓶颈将是从磁盘打开和读取文件。 You could look into Memory-Mapped Files if that turns out to be the case. 如果事实如此,您可以查看“ 内存映射文件 ”。 Alternatively, depending on what you're ultimately trying to do, a dedicated text searcher such as Lucene.Net (as I4V mentioned in the comments) might be more suitable. 或者,根据您最终想要做什么,一个专用的文本搜索器(例如Lucene.Net (如注释中提到的I4V)可能更合适。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM