在大文本文件C＃中搜索字符串模式

Question

I have been trying to search string patterns in a large text file. 我一直在尝试在大型文本文件中搜索字符串模式。 I am reading line by line and checking each line which is causing a lot of time. 我正在逐行阅读并检查每条导致大量时间的行。 I did try with HashSet and ReadAllLines . 我确实尝试了HashSet和ReadAllLines 。 HashSet<string> strings = new HashSet<string>(File.ReadAllLines(@"D:\\Doc\\Tst.txt"));

Now when I am trying to search the string, it's not matching. 现在，当我尝试搜索字符串时，它不匹配。 As it is looking for a match of the entire row. 正在寻找整个行的匹配项。 I just want to check if the string appears in the row. 我只想检查字符串是否出现在行中。

I had tried by using this: 我已经尝试过使用这个：

using (System.IO.StreamReader file = new System.IO.StreamReader(@"D:\Doc\Tst.txt"))
                {

                    while ((CurrentLine = file.ReadLine()) != null)
                    {
                        vals = chk_log(CurrentLine, date_Format, (range.Cells[i][counter]).Value2, vals);
                        if (vals == true)
                            break;
                    }
                }



bool chk_log(string LineText, string date_to_chk, string publisher, bool tvals)
        {
            if (LineText.Contains(date_to_chk))
                if (LineText.Contains(publisher))
                {
                    tvals = true;
                }
                else
                    tvals = false;
            else tvals = false;
            return tvals;

        }

But this is consuming too much time. 但这会浪费太多时间。 Any help on this would be good. 在这方面的任何帮助都是很好的。

Answer 1

Reading into a HashSet doesn't make sense to me (unless there are a lot of duplicated lines) since you aren't testing for membership of the set. 读HashSet对我来说没有任何意义（除非有很多重复的行），因为您没有测试该集的成员资格。

Taking a really naive approach you could just do this. 采取真正幼稚的方法，您可以做到这一点。

var isItThere = File.ReadAllLines(@"d:\docs\st.txt").Any(x => 
    x.Contains(date_to_chk) && x.Contains(publisher));

65K lines at (say) 1K a line isn't a lot of memory to worry about, and I personally wouldn't bother with Parallel since it sounds like it would be superfast to do anyway. 65K行（比如说1K行）没有太多的内存可担心，而且我个人也不会打扰Parallel因为听起来无论如何它都会超快。

You could replace Any where First to find the first result or Where to get an IEnumerable<string> containing all results. 您可以将Any替换为First以查找第一个结果，或者替换为Where以获取包含所有结果的IEnumerable<string> 。

Answer 2

You can use a compiled regular expression instead of String.Contains (compile once before looping over the lines). 您可以使用已编译的正则表达式来代替String.Contains （在遍历各行之前先编译一次）。 This typically gives better performance. 这通常可以提供更好的性能。

var regex = new Regex($"{date}|{publisher}", RegexOptions.Compiled);

foreach (string line in File.ReadLines(@"D:\Doc\Tst.txt"))
{
    if (regex.IsMatch(line)) break;
}

This also shows a convenient standard library function for reading a file line by line. 这也显示了方便的标准库功能，用于逐行读取文件。

Or, depending on what you want to do... 或者，根据您想做什么...

var isItThere = File.ReadLines(@"D:\Doc\Tst.txt").Any(regex.IsMatch);

在大文本文件C＃中搜索字符串模式

问题描述

2 个解决方案

解决方案1
2 2018-03-09 12:47:08

解决方案2
1 2018-03-09 13:00:12

在大文本文件C＃中搜索字符串模式

问题描述

2 个解决方案

解决方案1 2 2018-03-09 12:47:08

解决方案2 1 2018-03-09 13:00:12

解决方案1
2 2018-03-09 12:47:08

解决方案2
1 2018-03-09 13:00:12