Is there a regex to find all the digit sequences ( \\d+
) in text, but not the ones forming HTML entities? Look like I should use both "look ahead" and "look behind" together, but I can't figure out how.
For example, for the string ✑ #555 foo 777;
✑ #555 foo 777;
I want to match only 555
and 777
, but not 10001
.
I've tried
~(?<!(&#)|\d])\d+(?![\d|;])~
But it seems to be too strict, as it returns no matches for cases like 777;
You can probably use this regex with lookarounds:
(?<!&#)\b\d+\b|(?:^|\b)\d+\b(?!;|$)
I've found the solution the next morning.
(?<![(&#)\d])\d+|\d+(?!\d|;)
It's quite big and poorly readable, but it works. PS I think it's a lot easier just do decode/hide the entities before processing and then put them back.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.