简体   繁体   English

从文件中删除包含多个字符串的行的最有效方法?

[英]Most efficient way of removing lines that contain more than one string from a file?

I want to find the most efficient way of removing string 1 and string 2 when reading a file (host file) and remove the entire lines that contains string 1 or string 2. 我想找到一种在读取文件(主机文件)时删除字符串1和2的最有效方法,并删除包含字符串1或2的整行。

Currently I have, and is obviously sluggish. 我现在有,显然很迟钝。 What better methods are there? 有什么更好的方法?

using(StreamReader sr = File.OpenText(path)){
    while ((stringToRemove = sr.ReadLine()) != null)
    {
        if (!stringToRemove.Contains("string1"))
        {
            if (!stringToRemove.Contains("string2"))
            {
                emptyreplace += stringToRemove + Environment.NewLine;
            }
        }
    }
    sr.Close();
    File.WriteAllText(path, emptyreplace);
    hostFileConfigured = false;
    UInt32 result = DnsFlushResolverCache();
    MessageBox.Show(removeSuccess, windowOffline);
}

The primary problem that you have is that you are constantly using large regular strings and appending data onto the end. 您遇到的主要问题是,您一直在使用大型常规字符串并将数据附加到末尾。 This is re-creating the strings each time and consumes a lot of time and particularly memory. 每次都会重新创建字符串,并且会消耗大量时间,尤其是内存。 By using string.Join it will avoid the (very large number of) intermediate string values being created. 通过使用string.Join可以避免创建(大量)中间字符串值。

You can also shorten the code to get the lines of text by using File.ReadLines instead of using the stream directly. 您还可以通过使用File.ReadLines而不是直接使用流来缩短代码以获取文本行。 It's not really any better or worse, just prettier. 并不是更好或更糟,只是更漂亮。

var lines = File.ReadLines(path)
    .Where(line => !line.Contains("string1") && !line.Contains("string2"));

File.WriteAllText(path, string.Join(Environment.NewLine, lines));

Another option would be to stream the writing of the output as well. 另一个选择是同时输出输出内容。 Since there is no good library method for writing out a IEnumerable<string> without eagerly evaluating the input, we'll have to write our own (which is simple enough): 由于没有一种好的库方法来写出IEnumerable<string>而不急切地评估输入,因此我们必须编写自己的库(足够简单):

public static void WriteLines(string path, IEnumerable<string> lines)
{
    using (var stream = File.CreateText(path))
    {
        foreach (var line in lines)
            stream.WriteLine(line);
    }
}

Also note that if we're streaming our output then we'll need a temporary file, since we don't want to be reading and writing to the same file at the same time. 还要注意,如果我们要流式传输输出,那么我们将需要一个临时文件,因为我们不想同时读取和写入同一文件。

//same code as before
var lines = File.ReadLines(path)
    .Where(line => !line.Contains("string1") && !line.Contains("string2"));

//get a temp file path that won't conflict with any other files
string tempPath = Path.GetTempFileName();
//use the method from above to write the lines to the temp file
WriteLines(tempPath, lines);
//rename the temp file to the real file we want to replace, 
//both deleting the temp file and the old file at the same time
File.Move(tempPath, path);

The primary advantage of this option, as opposed to the first, is that it will consume far less memory. 此选项的主要优点,相对于所述第一,就是它会消耗少得多的存储器。 In fact, it only ever needs to hold line of the file in memory at a time, rather than the whole file. 实际上,它只需要一次将一行文件保存在内存中,而不是整个文件。 It does take up a bit of extra space on disk (temporarily) though. 但是,它确实会(临时)占用磁盘上的一些额外空间。

The first thing that shines to me, is wrong (not efficient) use of string type variable inside a while loop ( emptyreplace ), use StrinBuilder type and it will be much memory efficient. 对我而言,第一件事是,在while循环( emptyreplace )中使用string类型变量是错误的(效率不高),使用StrinBuilder类型,它将大大提高内存效率。

For example: 例如:

 StringBuilder emptyreplace = new StringBuilder(); 

using(StreamReader sr = File.OpenText(path)){
    while ((stringToRemove = sr.ReadLine()) != null)
    {
        if (!stringToRemove.Contains("string1"))
        {
            if (!stringToRemove.Contains("string2"))
            {
                //USE StringBuilder.Append, and NOT string concatenation
                emptyreplace.AppendLine(stringToRemove + Environment.NewLine);
            }
        }
    }
   ...
}

The rest seems good enough. 其余的似乎足够好。

Two suggestions: 两个建议:

  1. Create an array of strings to detect (I'll call them stopWords ) and use Linq's Any extension method. 创建一个字符串数组来检测(我将它们称为stopWords )并使用Linq的Any扩展方法。

  2. Rather than building the file up and writing it all at once, write each line to an output file one at a time while your reading the source file, and replace the source file once your done. 而不是一次构建并全部写入文件,而是在读取源文件时一次将每一行写入一个输出文件,并在完成后替换源文件。

The resulting code: 结果代码:

string[] stopWords = new string[]
{
    "string1",
    "string2"
}

using(StreamReader sr = File.OpenText(srcPath))
using(StreamWriter sw = new StreamWriter(outPath))
{
    while ((stringToRemove = sr.ReadLine()) != null)
    {
        if (!stopWords.Any(s => stringToRemove.Contains(s))
        {
            sw.WriteLine(stringToRemove);
        }
    }
}

File.Move(outPath, srcPath);

There are a number of ways to improve this: 有多种方法可以改善此问题:

  • Compile the array of words you're searching for into a regex (eg, word1|word2 ; beware of special characters) so that you'll only need to loop over the string once. 将要搜索的单词数组编译为正则表达式(例如, word1|word2 ;请注意特殊字符),这样您只需要在字符串上循环一次即可。 (this would also allow you to use \\b to only match words) (这也允许您使用\\b仅匹配单词)

  • Write each line through a StreamWriter to a new file so that you don't need to store the whole thing in memory while building it. 通过StreamWriter将每一行写到一个新文件中,以便在构建时不需要将整个内容存储在内存中。 (after you finish, delete the original file & rename the new one) (完成后,删除原始文件并重命名新文件)

Is your host file really that big that you need to bother with reading it line by line? 您的主机文件真的那么大,您需要逐行阅读吗? Why not simply do this? 为什么不简单地这样做呢?

var lines = File.ReadAllLines(path);
var lines = lines.Where(x => !badWords.Any(y => x.Contains(y))).ToArray();
File.WriteAllLines(path, lines);

Update : I just realized that you are actually talking about the "hosts file". 更新 :我刚刚意识到您实际上是在谈论“主机文件”。 Assuming you mean %windir%\\system32\\drivers\\etc\\hosts , it is very unlikely that this file has a truly significant size (like more than a couple of KBs). 假设您的意思是%windir%\\system32\\drivers\\etc\\hosts ,则此文件的大小实际上不太可能(例如超过几个KB)是不太可能的。 So personally, I would go with the most readable approach. 因此,就我个人而言,我将采用最易读的方法。 Like, for example, the one by @servy . 例如, @servy的那个

In the end you will have to read every line and write every line, that does not match your criteria. 最后,您将不得不阅读每行并写出每条不符合您的条件的行。 So, you will always have the basic IO overhead that you cannot avoid. 因此,您将始终拥有无法避免的基本IO开销。 Depending on the actual (average) size of your files that might overshadow every other optimization technique you use in your code to actually filter the lines. 取决于文件的实际(平均)大小,这可能会使您在代码中用于实际过滤行的其他所有优化技术都黯然失色。

Having that said, you can however be a little less wasteful on the memory side of things, by not collecting all output lines in a buffer, but directly writing them to the output file as you have read them (again, this might be pointless if you files are not very big). 话虽如此,通过不将所有输出行收集在缓冲区中,而是在读取它们时将它们直接写入输出文件中,您可以在内存方面减少一些浪费(同样,如果您的文件不是很大)。

using (var reader = new StreamReader(inputfile))
{
  using (var writer = new StreamWriter(outputfile))
  {
    string line;
    while ((line = reader.ReadLine()) != null)
    {
       if (line.IndexOf("string1") == -1 && line.IndexOf("string2") == -1)
       {
          writer.WriteLine(line);
       }
    }
  }
}

File.Move(outputFile, inputFile);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM