I'm parsing html node text with a regex looking for words to perform operations on.
I'm using (\\w+)
I have situations like word word
and the nbsp gets recognized as a word.
I can match the html entity with \\&[a-z0-9A-Z]+\\;
but I don't know how to unmatch a word if it is a part of the entity.
Is there a way to have a regex match a word but not if it is an html entity like the following?
<
<
ý
ý
etc etc
A negative lookbehind assertion might do the trick:
(?<!&#?)\b\w+
matches only if the word is not preceded by &
or &#
. It doesn't check for a semicolon, though, since that might legitimately follow a normal word.
Rather first use:
System.Web.HttpUtility.HtmlDecode(...)
or
System.Net.WebUtility.HtmlDecode(...)
on your HTML.
Decoding will convert all escaped characters to normal representation. Parse decoded HTML with regex afterwards.
Since you're using C# you could go a step further and check for the full
entity form.
This uses a conditional at a word boundary to check for
a forward semi-colon. If it's there, it uses a lookbehind to insure
this is not an entity.
# @"(?i)(\w+)\b(?(?=;)(?<!(?:&|%)(?:[a-z]+|(?:\#(?:[0-9]+|x[0-9a-f]+)))(?=;)))"
(?i)
( \w+ ) # (1)
\b
(?(?= ; ) # Conditional. Is ';' the next character ?
(?<! # Yes, then this word cannot be part of an entity
(?: & | % )
(?:
[a-z]+
| (?:
\#
(?:
[0-9]+
| x [0-9a-f]+
)
)
)
(?= ; )
)
)
Code:
string input = @"
< <
ý ý
etc etc
I have situations like word word and the nbsp gets recognized as a word.
";
Regex RxNonEntWords = new Regex(@"(?i)(\w+)\b(?(?=;)(?<!(?:&|%)(?:[a-z]+|(?:\#(?:[0-9]+|x[0-9a-f]+)))(?=;)))");
Match _m = RxNonEntWords.Match( input );
while (_m.Success)
{
Console.WriteLine("Found: {1}", _m.Groups[1].Value);
_m = _m.NextMatch();
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.