简体   繁体   English

在大文本文件C#中搜索字符串模式

[英]Search String Pattern in Large Text Files C#

I have been trying to search string patterns in a large text file. 我一直在尝试在大型文本文件中搜索字符串模式。 I am reading line by line and checking each line which is causing a lot of time. 我正在逐行阅读并检查每条导致大量时间的行。 I did try with HashSet and ReadAllLines . 我确实尝试了HashSetReadAllLines HashSet<string> strings = new HashSet<string>(File.ReadAllLines(@"D:\\Doc\\Tst.txt"));

Now when I am trying to search the string, it's not matching. 现在,当我尝试搜索字符串时,它不匹配。 As it is looking for a match of the entire row. 正在寻找整个行的匹配项。 I just want to check if the string appears in the row. 我只想检查字符串是否出现在行中。

I had tried by using this: 我已经尝试过使用这个:

using (System.IO.StreamReader file = new System.IO.StreamReader(@"D:\Doc\Tst.txt"))
                {

                    while ((CurrentLine = file.ReadLine()) != null)
                    {
                        vals = chk_log(CurrentLine, date_Format, (range.Cells[i][counter]).Value2, vals);
                        if (vals == true)
                            break;
                    }
                }



bool chk_log(string LineText, string date_to_chk, string publisher, bool tvals)
        {
            if (LineText.Contains(date_to_chk))
                if (LineText.Contains(publisher))
                {
                    tvals = true;
                }
                else
                    tvals = false;
            else tvals = false;
            return tvals;

        }

But this is consuming too much time. 但这会浪费太多时间。 Any help on this would be good. 在这方面的任何帮助都是很好的。

Reading into a HashSet doesn't make sense to me (unless there are a lot of duplicated lines) since you aren't testing for membership of the set. HashSet对我来说没有任何意义(除非有很多重复的行),因为您没有测试该集的成员资格。

Taking a really naive approach you could just do this. 采取真正幼稚的方法,您可以做到这一点。

var isItThere = File.ReadAllLines(@"d:\docs\st.txt").Any(x => 
    x.Contains(date_to_chk) && x.Contains(publisher));

65K lines at (say) 1K a line isn't a lot of memory to worry about, and I personally wouldn't bother with Parallel since it sounds like it would be superfast to do anyway. 65K行(比如说1K行)没有太多的内存可担心,而且我个人也不会打扰Parallel因为听起来无论如何它都会超快。

You could replace Any where First to find the first result or Where to get an IEnumerable<string> containing all results. 您可以将Any替换为First以查找第一个结果,或者替换为Where以获取包含所有结果的IEnumerable<string>

You can use a compiled regular expression instead of String.Contains (compile once before looping over the lines). 您可以使用已编译的正则表达式来代替String.Contains (在遍历各行之前先编译一次)。 This typically gives better performance. 这通常可以提供更好的性能。

var regex = new Regex($"{date}|{publisher}", RegexOptions.Compiled);

foreach (string line in File.ReadLines(@"D:\Doc\Tst.txt"))
{
    if (regex.IsMatch(line)) break;
}

This also shows a convenient standard library function for reading a file line by line. 这也显示了方便的标准库功能,用于逐行读取文件。

Or, depending on what you want to do... 或者,根据您想做什么...

var isItThere = File.ReadLines(@"D:\Doc\Tst.txt").Any(regex.IsMatch);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM