简体   繁体   中英

Search String Pattern in Large Text Files C#

I have been trying to search string patterns in a large text file. I am reading line by line and checking each line which is causing a lot of time. I did try with HashSet and ReadAllLines . HashSet<string> strings = new HashSet<string>(File.ReadAllLines(@"D:\\Doc\\Tst.txt"));

Now when I am trying to search the string, it's not matching. As it is looking for a match of the entire row. I just want to check if the string appears in the row.

I had tried by using this:

using (System.IO.StreamReader file = new System.IO.StreamReader(@"D:\Doc\Tst.txt"))
                {

                    while ((CurrentLine = file.ReadLine()) != null)
                    {
                        vals = chk_log(CurrentLine, date_Format, (range.Cells[i][counter]).Value2, vals);
                        if (vals == true)
                            break;
                    }
                }



bool chk_log(string LineText, string date_to_chk, string publisher, bool tvals)
        {
            if (LineText.Contains(date_to_chk))
                if (LineText.Contains(publisher))
                {
                    tvals = true;
                }
                else
                    tvals = false;
            else tvals = false;
            return tvals;

        }

But this is consuming too much time. Any help on this would be good.

Reading into a HashSet doesn't make sense to me (unless there are a lot of duplicated lines) since you aren't testing for membership of the set.

Taking a really naive approach you could just do this.

var isItThere = File.ReadAllLines(@"d:\docs\st.txt").Any(x => 
    x.Contains(date_to_chk) && x.Contains(publisher));

65K lines at (say) 1K a line isn't a lot of memory to worry about, and I personally wouldn't bother with Parallel since it sounds like it would be superfast to do anyway.

You could replace Any where First to find the first result or Where to get an IEnumerable<string> containing all results.

You can use a compiled regular expression instead of String.Contains (compile once before looping over the lines). This typically gives better performance.

var regex = new Regex($"{date}|{publisher}", RegexOptions.Compiled);

foreach (string line in File.ReadLines(@"D:\Doc\Tst.txt"))
{
    if (regex.IsMatch(line)) break;
}

This also shows a convenient standard library function for reading a file line by line.

Or, depending on what you want to do...

var isItThere = File.ReadLines(@"D:\Doc\Tst.txt").Any(regex.IsMatch);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM