简体   繁体   English

正则表达式问题和C#

[英]Regex issue and c#

I am having an issue with a string manipulation method I have written. 我编写的字符串操作方法有问题。 The purpose of this method is to seek out link tags within a long string, and reformat their hrefs. 此方法的目的是在长字符串中查找链接标记,然后重新格式化其hrefs。

To give some context, I am parsing a large number of HTML files that were on a CD and collating the results in to xml files that are on a website in a separate project (I wrote this as part of a console app). 为了提供一些背景信息,我将解析CD上的大量HTML文件,并将结果整理到一个单独项目中的网站上的xml文件中(我将其编写为控制台应用程序的一部分)。 The html files contain instructional text and this contains links that are relative to the files on the CD, and I need to change the hrefs to be relative to the website the information is going on. html文件包含说明文本,其中包含相对于CD上文件的链接,我需要将hrefs更改为相对于正在运行信息的网站。

The following code appears to work just fine if there is only one link tag, but pass it two, and the output is very messed up. 如果只有一个链接标记,但是将其传递给两个,则下面的代码似乎可以正常工作,并且输出非常混乱。 Strangely, Visual Studio's Regex editor claims that the linkTag regex below is only matching the link tags, but when it comes round to replacing the links with the correct hrefs, it inserts link fragments at various points within the instructions string. 奇怪的是,Visual Studio的Regex编辑器声称下面的linkTag regex仅与链接标签匹配,但是当要用正确的href替换链接时,它将在指令字符串的各个点插入链接片段。

The reason for the additional regex's alphaDir is that I will eventually expand this method to correct links with different starting hrefs. 使用其他正则表达式的alphaDir的原因是,我最终将扩展此方法,以更正具有不同起始href的链接。 We are talking about parsing thousands of html files, but this format is the most common by far. 我们正在谈论解析成千上万的html文件,但是这种格式是迄今为止最常见的格式。

I am at a bit of a loss on this one as I am very much a regex beginner, and wrote all of the regex's below myself, so any thoughts on any of these would be great too. 我对这是一个正则表达式的初学者,并把所有的正则表达式写在自己的下面,所以我对此感到有些茫然,因此,对这些正则表达式的任何想法也都很棒。

Typical Input string 典型输入字符串

Hold 1st <strong><a href="../f/fist_hand.html">FIST</a></strong> hand, back outward
  &amp; fingers forward, and put 2nd <strong><a href="../f/fist_hand.html">FIST</a></strong> hand, back forward
  &amp; fingers inward, with lower knuckle of its 4th finger on
  lower knuckle of 1st thumb; then slide 2nd hand forwards one
  hand's length.

The Method 方法

static string instructions(string instructions)
    {
        Regex Spaces = new Regex(@"\s+|\n|\r");
        Regex linkTag = new Regex(@"<a(.*?)>(.*?)<\/a>");
        Regex linkTagHtml = new Regex(@"<a(.*?)>|<\/a>");
        Regex hrefAttr = new Regex("href=\"(.)*?\"");
        Regex alphaDir = new Regex(@"/([a-z])?/");

        string signName = string.Empty;
        char alphaChar;
        string replacementLinkTag = string.Empty;
        string replacementHref = string.Empty;

        instructions = Spaces.Replace(instructions, " ");

        MatchCollection matches = linkTag.Matches(instructions);

        foreach (Match link in matches)
        {
            Match alphaDirMatch = alphaDir.Match(link.Value.ToString());
            if (alphaDirMatch.Success)
            {
                Match hrefAttrMatch = hrefAttr.Match(link.Value.ToString());
                if (hrefAttrMatch.Success)
                {
                    signName = linkTagHtml.Replace(link.Value.ToString(), string.Empty).ToLower().Trim();
                    signName = signName.Replace(" ", "_");
                    alphaChar = signName[0];

                    replacementHref = "href=\"/pages/displayc.aspx?c=dictionary&alpha=" + alphaChar.ToString() +"&sign=" + signName + "\"";
                    replacementLinkTag = hrefAttr.Replace(link.Value.ToString(), replacementHref);

                    instructions = instructions.Remove(link.Index, link.Length);
                    instructions = instructions.Insert(link.Index, replacementLinkTag);
                }
            }
        }            

        return instructions;
    }

Current output string 当前输出字符串

Hold 1st <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back outward &amp; finge<a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a>f="../f/fist_hand.html">FIST</a></strong> hand, back forward &amp; fingers inward, with lower knuckle of its 4th finger on lower knuckle of 1st thumb; then slide 2nd hand forwards one hand's length.

Desired output string 所需的输出字符串

Hold 1st <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back outward &amp; fingers forward, and put 2nd <strong><a href="/pages/displayc.aspx?c=dictionary&alpha=f&sign=fist">FIST</a></strong> hand, back forward &amp; fingers inward, with lower knuckle of its 4th finger on lower knuckle of 1st thumb; then slide 2nd hand forwards one hand's length.

The solution - Thanks for the suggestion Oded! 解决方案-感谢Oded的建议!

I used the HtmlAgilityPack to load the instructions string as html, and found the link tags storing these in a HtmlNodeCollection, looping over each and getting the href values, and doing the edits. 我使用HtmlAgilityPack将指令字符串加载为html,然后找到了将这些链接标签存储在HtmlNodeCollection中,遍历每个变量并获取href值并进行编辑的链接标签。

The code ended up looking like this for those interested: 对于那些感兴趣的人,代码最终看起来像这样:

static string instructions(string instructions)
    {
        char alphaChar;
        Regex Spaces = new Regex(@"\s+|\n|\r");
        Regex alphaDir = new Regex(@"/([a-z])?/");
        string signName = string.Empty;
        string replacementHref = string.Empty;

        instructions = Spaces.Replace(instructions, " ");

        HtmlDocument instr = new HtmlDocument();
        instr.LoadHtml(instructions);

        HtmlNodeCollection links = instr.DocumentNode.SelectNodes("//a");

        if (links != null)
        {
            foreach (HtmlNode link in links)
            {
                string href = link.GetAttributeValue("href", string.Empty);

                if (!string.IsNullOrWhiteSpace(href))
                {
                    Match alphaDirMatch = alphaDir.Match(href);

                    if (alphaDirMatch.Success)
                    {
                        signName = Regex.Replace(href, "(.)*?/([a-z])?/|(.html)?", string.Empty);
                        signName = signName.Replace(" ", "_");
                        alphaChar = signName[0];

                        replacementHref = "/pages/displayc.aspx?c=dictionary&alpha=" + alphaChar.ToString() + "&sign=" + signName;
                        link.SetAttributeValue("href", replacementHref);
                    }
                }
            }
        }

        instructions = instr.DocumentNode.InnerHtml.ToString();

        return instructions;
    }

I recommend trying the HTML Agility Pack to parse and query your HTML documents. 我建议尝试使用HTML Agility Pack来解析和查询HTML文档。

Using RegEx can be rather brittle, and if the documents are not very uniform may be an approach that will not work - see this SO answer . 使用RegEx可能会很脆弱,如果文档不是非常统一,则可能无法使用,请参见此答案

In addition to @ Oded's answer you could do this with a simple XSL transform. 除了@ Oded的答案,您还可以通过简单的XSL转换来完成此操作。 Regex IMO is not the way to go here. 正则表达式IMO并不是要走的路。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM