简体   繁体   English

.net正则表达式搜索和string.replace

[英].net Regex Search and string.replace

my xml file is around 7mb . 我的xml文件大约7mb。 I have to remove some invalid characters from some of nodes. 我必须从某些节点中删除一些无效字符。 there are many nodes like "title" , "country" and so on .. 有很多节点,例如“ title”,“ country”等等。

I am having 31000 matches for "title" node and it is taking more than 35 mins . 我的“ title”节点有31000个匹配项,并且花费了超过35分钟的时间。 which not acceptable for my project requirements , How can I optimise this 我的项目要求不可接受,我该如何优化

method call 方法调用

  fileText = RemoveInvalidCharacters(fileText, "title", @"(&#[xX]?[A-Fa-f\d]+;)|[^\w\s\/\;\&\.@-]", "$1");  

Method definition 方法定义

private static string RemoveInvalidCharacters(string fileText, string nodeName, string regexPattern, string regexReplacement)
        {
            foreach (Match match in Regex.Matches(fileText, @"<" + nodeName + ">(.*)</" + nodeName + ">"))
            {
                var oldValue = match.Groups[0].Value;
                var newValue = "<" + nodeName + ">" + Regex.Replace(match.Groups[1].Value, regexPattern, regexReplacement) +
                               "</" + nodeName + ">";
                fileText = fileText.Replace(oldValue, newValue);
            }

            return fileText;
        }

Instead of using Regex to parse the Xml Document, you can use the tools in the System.Xml.Linq namespace to handle the parsing for you, which is inherently much faster and easier to use. 您可以使用System.Xml.Linq命名空间中的工具来代替您使用Regex来解析Xml文档,这本质上是更快,更容易使用。

Here's an example program that takes a structure with 35,000 nodes in. I've kept your regex string to check for the bad characters, but I've specified it as a Compiled regex string, which should yield better performance, although admittedly, not a huge increase when I compared the two. 这是一个示例程序,它采用的结构包含35,000个节点。我保留了您的regex字符串以检查不良字符,但我将其指定为Compiled regex字符串,这应该会产生更好的性能,尽管公认的是,当我将两者进行比较时,数量会大大增加。 More info . 更多信息

This example uses Descendants , which gets references to all of the element you specify in the parameter within the element specified (in this case, we've started from the root element). 本示例使用Descendants ,它获取对您在指定元素内的参数中指定的所有元素的引用(在本例中,我们从根元素开始)。 Those results are filtered by the ContainsBadCharacters method. 这些结果通过ContainsBadCharacters方法进行过滤。

For the sake of simplicity I haven't made the foreach loops DRY, but it's probably worth doing so. 为了简单起见,我没有将foreach循环设为DRY,但是这样做可能是值得的。

On my machine, this runs in less than a second, but timings will vary based on machine performance and occurrences of bad characters. 在我的计算机上,此过程运行的时间不到一秒钟,但是时间会根据计算机的性能和不良字符的出现而有所不同。

using System;
using System.IO;
using System.Linq;
using System.Reflection;
using System.Text;
using System.Text.RegularExpressions;
using System.Xml.Linq;

namespace ConsoleApplication2
{
    class Program
    {
        static Regex r = new Regex(@"(&#[xX]?[A-Fa-f\d]+;)|[^\w\s\/\;\&\.@-]", RegexOptions.Compiled);

        static void Main(string[] args)
        {
            System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
            var xmls = new StringBuilder("<Nodes>");
            for(int i = 0;i<35000;i++)
            {
                xmls.Append(@"<Node>
                                  <Title>Lorem~~~~</Title>
                                  <Country>Ipsum!</Country>
                               </Node>");
            }
            xmls.Append("</Nodes>");

            var doc = XDocument.Parse(xmls.ToString());

            sw.Start();
            foreach(var element in doc.Descendants("Title").Where(ContainsBadCharacters))
            {               
                element.Value = r.Replace(element.Value, "$1");
            }
            foreach (var element in doc.Descendants("Country").Where(ContainsBadCharacters))
            {
                element.Value = r.Replace(element.Value, "$1");
            }
            sw.Stop();

            var saveFile = new FileInfo(Path.Combine(Assembly.GetExecutingAssembly().Location.Substring(0, 
                Assembly.GetExecutingAssembly().Location.LastIndexOf(@"\")), "test.txt"));
            if (!saveFile.Exists) saveFile.Create();

            doc.Save(saveFile.FullName);
            Console.WriteLine(sw.Elapsed);
            Console.Read();
        }

        static bool ContainsBadCharacters(XElement item)
        {
            return r.IsMatch(item.Value);
        }
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM