简体   繁体   English

正则表达式性能下降

[英]regex performance degrades

I'm writing a C# application that runs a number of regular expressions (~10) on a lot (~25 million) of strings. 我正在编写一个C#应用程序,该应用程序在大量(约2500万个)字符串上运行许多正则表达式(约10个)。 I did try to google this, but any searches for regex with "slows down" are full of tutorials about how backreferencing etc. slows down regexes. 我确实尝试过在Google上进行搜索,但是任何带有“ slows down”的正则表达式搜索都包含有关反向引用等如何减慢regexes速度的教程。 I am assuming that this is not my problem because my regexes start out fast and slow down. 我以为这不是我的问题,因为我的正则表达式开始时快而慢。

For the first million or so strings it takes about 60ms per 1000 strings to run the regular expressions. 对于前一百万个左右的字符串,每运行1000个字符串大约需要60毫秒才能运行正则表达式。 By the end, it's slowed down to the point where its taking about 600ms. 到最后,它的速度已减慢到约600毫秒。 Does anyone know why? 有人知道为什么吗?

It was worse, but I improved it by using instances of RegEx instead of the cached version and compiling the expressions that I could. 情况更糟,但是我通过使用RegEx实例而不是缓存版本并编译我可以的表达式来改进了它。

Some of my regexes need to vary eg depending on the user's name it might be mike said (\\w*) or john said (\\w*) 我的某些正则表达式需要有所不同,例如,取决于用户名,可能是mike said (\\w*)john said (\\w*)

My understanding is that it is not possible to compile those regexes and pass in parameters (eg saidRegex.Match(inputString, userName) ). 我的理解是不可能编译那些正则表达式并传递参数(例如saidRegex.Match(inputString, userName) )。

Does anyone have any suggestions? 有没有人有什么建议?

[Edited to accurately reflect speed - was per 1000 strings, not per string] [编辑以准确反映速度-是每1000个字符串,而不是每个字符串]

This may not be a direct answer to your question about RegEx performance degradation - which is somewhat fascinating. 这可能不是您对RegEx性能下降问题的直接答案-这有点令人着迷。 However - after reading all of the commentary and discussion above - I'd suggest the following: 但是,在阅读完以上所有评论和讨论后,我建议以下内容:

Parse the data once, splitting out the matched data into a database table. 解析一次数据,将匹配的数据拆分到数据库表中。 It looks like you're trying to capture the following fields: 您似乎正在尝试捕获以下字段:

Player_Name | Monetary_Value

If you were to create a database table containing these values per-row, and then catch each new row as it is being created - parse it - and append to the data table - you could easily do any kind of analysis / calculation against the data - without having to parse 25M rows again and again (which is a waste). 如果要创建一个包含每行这些值的数据库表,然后在创建新行时对其进行捕获(将其解析)并追加到数据表中,则可以轻松地对数据进行任何类型的分析/计算-不必一次又一次地解析25M行(这很浪费)。

Additionally - on the first run, if you were to break the 25M records down into 100,000 record blocks, then run the algorithm 250 times (100,000 x 250 = 25,000,000) - you could enjoy all the performance you're describing with no slow-down, because you're chunking up the job. 另外-在第一次运行时,如果要将25M记录分解为100,000个记录块,然后运行该算法250次(100,000 x 250 = 25,000,000)-您可以享受所描述的所有性能而不会降低速度,因为您正在加紧工作。

In other words - consider the following: 换句话说,请考虑以下因素:

  1. Create a database table as follows: 创建数据库表,如下所示:

     CREATE TABLE PlayerActions ( RowID INT PRIMARY KEY IDENTITY, Player_Name VARCHAR(50) NOT NULL, Monetary_Value MONEY NOT NULL ) 
  2. Create an algorithm that breaks your 25m rows down into 100k chunks. 创建一个算法,将您的2500万行分解为10万个块。 Example using LINQ / EF5 as an assumption. 以LINQ / EF5作为假设的示例。

     public void ParseFullDataSet(IEnumerable<String> dataSource) { var rowCount = dataSource.Count(); var setCount = Math.Floor(rowCount / 100000) + 1; if (rowCount % 100000 != 0) setCount++; for (int i = 0; i < setCount; i++) { var set = dataSource.Skip(i * 100000).Take(100000); ParseSet(set); } } public void ParseSet(IEnumerable<String> dataSource) { String playerName = String.Empty; decimal monetaryValue = 0.0m; // Assume here that the method reflects your RegEx generator. String regex = RegexFactory.Generate(); for (String data in dataSource) { Match match = Regex.Match(data, regex); if (match.Success) { playerName = match.Groups[1].Value; // Might want to add error handling here. monetaryValue = Convert.ToDecimal(match.Groups[2].Value); db.PlayerActions.Add(new PlayerAction() { // ID = ..., // Set at DB layer using Auto_Increment Player_Name = playerName, Monetary_Value = monetaryValue }); db.SaveChanges(); // If not using Entity Framework, use another method to insert // a row to your database table. } } } 
  3. Run the above one time to get all of your pre-existing data loaded up. 运行以上一次,以加载所有现有数据。

  4. Create a hook someplace which allows you to detect the addition of a new row. 在某个位置创建一个钩子,该钩子使您可以检测到新行的添加。 Every time a new row is created, call: 每次创建新行时,请调用:

     ParseSet(new List<String>() { newValue }); 

    or if multiples are created at once, call: 或者如果一次创建多个,请致电:

     ParseSet(newValues); // Where newValues is an IEnumerable<String> 

Now you can do whatever computational analysis or data mining you want from the data, without having to worry about performance over 25m rows on-the-fly. 现在,您可以进行所需的任何计算分析或数据挖掘工作,而不必担心运行中超过2500万行的性能。

Regex does takes time to compute. 正则表达式确实需要时间来计算。 However, U can make it compact using some tricks. 但是,U可以使用一些技巧使其紧凑。 You can also use string functions in C# to avoid regex function. 您也可以在C#中使用字符串函数来避免正则表达式函数。

The code would be lengthy but might improve performance. 该代码将很长,但可能会提高性能。 String has several functions to cut and extract characters and do pattern matching as u need. 字符串具有多种功能,可以根据需要剪切和提取字符以及进行模式匹配。 like eg: IndeOfAny, LastIndexOf, Contains.... 例如:IndeOfAny,LastIndexOf,包含...

string str= "mon";
string[] str2= new string[] {"mon","tue","wed"};

if(str2.IndexOfAny(str) >= 0)
{
  //success code//
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM