[英]Search String Pattern in Large Text Files C#
I have been trying to search string patterns in a large text file. 我一直在尝试在大型文本文件中搜索字符串模式。 I am reading line by line and checking each line which is causing a lot of time.
我正在逐行阅读并检查每条导致大量时间的行。 I did try with
HashSet
and ReadAllLines
. 我确实尝试了
HashSet
和ReadAllLines
。 HashSet<string> strings = new HashSet<string>(File.ReadAllLines(@"D:\\Doc\\Tst.txt"));
Now when I am trying to search the string, it's not matching. 现在,当我尝试搜索字符串时,它不匹配。 As it is looking for a match of the entire row.
正在寻找整个行的匹配项。 I just want to check if the string appears in the row.
我只想检查字符串是否出现在行中。
I had tried by using this: 我已经尝试过使用这个:
using (System.IO.StreamReader file = new System.IO.StreamReader(@"D:\Doc\Tst.txt"))
{
while ((CurrentLine = file.ReadLine()) != null)
{
vals = chk_log(CurrentLine, date_Format, (range.Cells[i][counter]).Value2, vals);
if (vals == true)
break;
}
}
bool chk_log(string LineText, string date_to_chk, string publisher, bool tvals)
{
if (LineText.Contains(date_to_chk))
if (LineText.Contains(publisher))
{
tvals = true;
}
else
tvals = false;
else tvals = false;
return tvals;
}
But this is consuming too much time. 但这会浪费太多时间。 Any help on this would be good.
在这方面的任何帮助都是很好的。
Reading into a HashSet
doesn't make sense to me (unless there are a lot of duplicated lines) since you aren't testing for membership of the set. 读
HashSet
对我来说没有任何意义(除非有很多重复的行),因为您没有测试该集的成员资格。
Taking a really naive approach you could just do this. 采取真正幼稚的方法,您可以做到这一点。
var isItThere = File.ReadAllLines(@"d:\docs\st.txt").Any(x =>
x.Contains(date_to_chk) && x.Contains(publisher));
65K lines at (say) 1K a line isn't a lot of memory to worry about, and I personally wouldn't bother with Parallel
since it sounds like it would be superfast to do anyway. 65K行(比如说1K行)没有太多的内存可担心,而且我个人也不会打扰
Parallel
因为听起来无论如何它都会超快。
You could replace Any
where First
to find the first result or Where
to get an IEnumerable<string>
containing all results. 您可以将
Any
替换为First
以查找第一个结果,或者替换为Where
以获取包含所有结果的IEnumerable<string>
。
You can use a compiled regular expression instead of String.Contains
(compile once before looping over the lines). 您可以使用已编译的正则表达式来代替
String.Contains
(在遍历各行之前先编译一次)。 This typically gives better performance. 这通常可以提供更好的性能。
var regex = new Regex($"{date}|{publisher}", RegexOptions.Compiled);
foreach (string line in File.ReadLines(@"D:\Doc\Tst.txt"))
{
if (regex.IsMatch(line)) break;
}
This also shows a convenient standard library function for reading a file line by line. 这也显示了方便的标准库功能,用于逐行读取文件。
Or, depending on what you want to do... 或者,根据您想做什么...
var isItThere = File.ReadLines(@"D:\Doc\Tst.txt").Any(regex.IsMatch);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.