使用C＃和Regex查找并围绕某些带有范围的html文本中的所有单词和数字

Question

I need to surround every word in loaded html text with a span which will uniquely identify every word. 我需要用加载的html文本中的每个单词包围，该跨度将唯一地标识每个单词。 The problem is that some content is not being handled by my regex pattern. 问题是我的正则表达式模式没有处理某些内容。 My current problems include... 我目前的问题包括......

1) Special html characters like ” “ 1）特殊的html字符，如” “ ” “ are treated as words. 被视为单词。

2) Currency values. 2）货币价值。 eg $2,500 end up as "2" "500" (I need "$2,500") 例如2,500美元最终为“2”“500”（我需要“$ 2,500”）

3) Double hyphened words. 3）双连词。 eg one-legged-man. 例如单腿男子 end up "one-legged" "man" 结束“单腿”“男人”

I'm new to regular expressions and after looking at various other posts have derived the following pattern that seems to work for everything except the above exceptions. 我是正则表达式的新手，在查看了各种其他帖子之后，我们得出了以下模式，除了上述例外之外，它似乎适用于所有内容。 What I have so far is: 到目前为止我所拥有的是：

string pattern = @"(?<!<[^>]*?)\b('\w+)|(\w+['-]\w+)|(\w+')|(\w+)\b(?![^<]*?>)";
string newText = Regex.Replace(oldText, pattern, delegate(Match m) {
                  wordCnt++;
                  return "<span data-wordno='" + wordCnt.ToString() + "'>" + m.Value + "</span>";
 });

How can I fix/extend the above pattern to cater for these problems or should I be using a different approach all together? 我如何修复/扩展上述模式以迎合这些问题，还是应该一起使用不同的方法？

Answer 1

A fundamental problem that you're up against here is that html is not a "regular language". 你在这里遇到的根本问题是html不是一种“常规语言”。 This means that html is complex enough that you are always going to be able to come up with valid html that isn't recognized by any regular expression. 这意味着html足够复杂，你总是能够提出任何正则表达式无法识别的有效html。 It isn't a matter of writing a better regular expression; 这不是写一个更好的正则表达式的问题; this is a problem that regex can't solve. 这是正则表达式无法解决的问题。

What you need is a dedicated html parser. 你需要的是一个专用的html解析器。 You could try this nuget package . 你可以尝试这个nuget包。 There are many others, but HtmlAgilityPack is quite popular. 还有很多其他的，但HtmlAgilityPack很受欢迎。

Edit: Below is an example program using HtmlAgilityPack. 编辑：下面是一个使用HtmlAgilityPack的示例程序。 When an HTML document is parsed, the result is a tree (aka the DOM). 解析HTML文档时，结果是树（也称为DOM）。 In the DOM, text is stored inside text nodes. 在DOM中，文本存储在文本节点中。 So something like <p>Hello World<\\p> is parsed into a node to represent the p tag, with a child text node to hold the "Hello World". 所以像<p>Hello World<\\p>这样的东西被解析成一个节点来表示p标签，子文本节点用来保存“Hello World”。 So what you want to do is find all the text nodes in your document, and then, for each node, split the text into words and surround the words with spans. 因此，您要做的是找到文档中的所有文本节点，然后，对于每个节点，将文本拆分为单词并用跨度围绕单词。

You can search for all the text nodes using an xpath query. 您可以使用xpath查询搜索所有文本节点。 The xpath I have below is /html/body//*[not(self::script)]/text() , which avoids the html head and any script tags in the body. 我下面的xpath是/html/body//*[not(self::script)]/text() ，它避免了html头和正文中的任何脚本标记。

class Program
{
    static void Main(string[] args)
    {
        var doc = new HtmlDocument();
        doc.Load(args[0]);
        var wordCount = 0;
        var nodes = doc.DocumentNode
                       .SelectNodes("/html/body//*[not(self::script)]/text()");
        foreach (var node in nodes)
        {
            var words = node.InnerHtml.Split(' ');
            var surroundedWords = words.Select(word =>
            {
                if (String.IsNullOrWhiteSpace(word))
                {
                    return word;
                }
                else
                {
                    return $"<span data-wordno={wordCount++}>{word}</span>";
                }
            });
            var newInnerHtml = String.Join("", surroundedWords);
            node.InnerHtml = newInnerHtml;
        }

        WriteLine(doc.DocumentNode.InnerHtml);
    }
}

Answer 2

Fix 1) by adding "negative look-behind assertions" (?<!\\&) . 修复1）添加“负面后瞻断言” (?<!\\&) 。 I believe they are needed at the beginning of the 1st, 3rd, and 4th alternatives in the original pattern above. 我认为在上述原始模式的第1，第3和第4替代品的开头需要它们。

Fix 2) by adding a new alternative |(\\$?(\\d+[,.])+\\d+)' at the end of pattern. 修复2）在模式的末尾添加一个新的替代|(\\$?(\\d+[,.])+\\d+)' 。 This also handles non-dollar and decimal-pointed numbers at the same time. 这也同时处理非美元和十进制数字。

Fix 3) by enhancing the (\\w+['-]\\w+) alternative to read instead ((\\w+['-])+\\w+) . 修复3）通过增强(\\w+['-]\\w+)替代读取((\\w+['-])+\\w+) 。

使用C＃和Regex查找并围绕某些带有范围的html文本中的所有单词和数字

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-09-27 15:34:19

解决方案2
0 2015-09-27 18:41:00

使用C＃和Regex查找并围绕某些带有范围的html文本中的所有单词和数字

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-09-27 15:34:19

解决方案2 0 2015-09-27 18:41:00

解决方案1
3 已采纳 2015-09-27 15:34:19

解决方案2
0 2015-09-27 18:41:00