简体   繁体   English

如何在大文本文件中搜索字符串?

[英]How to search string in large text file?

I want to get the line containing a certain word that cannot be repeated like profile ID without make loop to read each of line separately, Because if the word I am looking for is in the last line of the text file, this will take a lot of time to get it, and if the search process is for more than one word and extract the line that contains it, I think it will take a lot of time.我想获取包含某个不能重复的单词的行,例如配置文件 ID 没有 make loop 来分别读取每一行,因为如果我要查找的单词在文本文件的最后一行,这将花费很多时间获取它的时间,如果搜索过程是针对多个单词并提取包含它的行,我认为这将花费很多时间。

Example for line text file name,id,image,age,place,link行文本文件名、id、图像、年龄、地点、链接的示例

string word = "13215646";
string output = string.Empty;
    
using (var fileStream = File.OpenRead(FileName))
using (var streamReader = new StreamReader(fileStream, Encoding.UTF8))
{
    String line;
    while ((line = streamReader.ReadLine()) != null)
    {
        string[] strList = line.Split(',');
        if (word == strList[1]) // check if word = id
        {
            output = line;
            break;
        }
    }
}

You can use this to search the file:您可以使用它来搜索文件:

var output = File.ReadLines(FileName).
    Where(line => line.Split(',')[1] == word).
    FirstOrDefault();

But it won't solve this:但它不会解决这个问题:

if the word I am looking for is in the last line of the text file, this will take a lot of time to get it, and if the search process is for more than one word and extract the line that contains it, I think it will take a lot of time.如果我要查找的单词在文本文件的最后一行,这将花费大量时间来获取它,如果搜索过程是针对多个单词并提取包含它的行,我认为它会花很多时间。

There's not a practical way to avoid this for a basic file.对于基本文件,没有一种实用的方法可以避免这种情况。

The only ways around actually reading through the file is either maintaining an index, which requires absolute control over everything that might write into the file, or if you can guarantee the file is already sorted by the columns that matter, in which case you can do something like a binary search.实际读取文件的唯一方法是维护索引,这需要对可能写入文件的所有内容进行绝对控制,或者如果您可以保证文件已经按重要的列排序,在这种情况下您可以这样做类似于二进制搜索的东西。

But neither is likely for a random csv file.但对于随机 csv 文件,两者都不太可能。 This is one of the reasons people use databases.这是人们使用数据库的原因之一。

However, we also need to stop and check whether this is really a problem for you.但是,我们也需要停下来检查这对您来说是否真的是一个问题。 I'd expect the code above to handle files up to a couple hundred MB in around 1 to 2 seconds on modern hardware, even if you need to look through the whole file.我希望上面的代码能够在现代硬件上在大约 1 到 2 秒内处理高达几百 MB 的文件,即使您需要查看整个文件。

You can optimise the code.您可以优化代码。 Here are few ideas:这里有几个想法:

var ids = new ["13215646", "113"];

foreach(var line in File.ReadLines(FileName))
{
    var id = line.Split(',', count: 3)[1]; // Optimization 1: Use: `count: 3`
    if(ids.Contains(id)                    // Optimization 2: Search for multiple ids 
    {
       //Do what you need with the line
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM