Regex to match words but not html entities

Question

I'm parsing html node text with a regex looking for words to perform operations on.
I'm using (\\w+)

I have situations like word word and the nbsp gets recognized as a word.

I can match the html entity with \\&[a-z0-9A-Z]+\\; but I don't know how to unmatch a word if it is a part of the entity.

Is there a way to have a regex match a word but not if it is an html entity like the following?

 
< <
ý ý
etc etc

Answer 1

A negative lookbehind assertion might do the trick:

(?<!&#?)\b\w+

matches only if the word is not preceded by & or &# . It doesn't check for a semicolon, though, since that might legitimately follow a normal word.

Answer 2

Rather first use:

System.Web.HttpUtility.HtmlDecode(...)

or

System.Net.WebUtility.HtmlDecode(...)

on your HTML.

Decoding will convert all escaped characters to normal representation. Parse decoded HTML with regex afterwards.

Answer 3

Since you're using C# you could go a step further and check for the full
entity form.

This uses a conditional at a word boundary to check for
a forward semi-colon. If it's there, it uses a lookbehind to insure
this is not an entity.

 # @"(?i)(\w+)\b(?(?=;)(?<!(?:&|%)(?:[a-z]+|(?:\#(?:[0-9]+|x[0-9a-f]+)))(?=;)))"

 (?i)
 ( \w+ )                       # (1)
 \b 
 (?(?= ; )                     # Conditional. Is ';' the next character ? 
      (?<!                          # Yes, then this word cannot be part of an entity
           (?: & | % )
           (?:
                [a-z]+ 
             |  (?:
                     \#
                     (?:
                          [0-9]+ 
                       |  x [0-9a-f]+ 
                     )
                )
           )
           (?= ; )
      )
 )

Code:

string input = @"
&nbsp;
&lt; <
&#253; ý
etc etc
I have situations like word&nbsp;word and the nbsp gets recognized as a word.
";

Regex RxNonEntWords = new Regex(@"(?i)(\w+)\b(?(?=;)(?<!(?:&|%)(?:[a-z]+|(?:\#(?:[0-9]+|x[0-9a-f]+)))(?=;)))");
Match _m = RxNonEntWords.Match( input );
while (_m.Success)
{
    Console.WriteLine("Found: {1}", _m.Groups[1].Value);
    _m = _m.NextMatch();
}

Regex to match words but not html entities

Question

3 answers

solution1
4 ACCPTED 2015-07-07 20:48:19

solution2
1 2015-07-07 21:02:04

solution3
1

Regex to match words but not html entities

Question

3 answers

solution1 4 ACCPTED 2015-07-07 20:48:19

solution2 1 2015-07-07 21:02:04

solution3 1

solution1
4 ACCPTED 2015-07-07 20:48:19

solution2
1 2015-07-07 21:02:04

solution3
1