简体   繁体   中英

Regular expression to match numbers, but not HTML entities

Is there a regex to find all the digit sequences ( \\d+ ) in text, but not the ones forming HTML entities? Look like I should use both "look ahead" and "look behind" together, but I can't figure out how.

For example, for the string ✑ #555 foo 777; ✑ #555 foo 777; I want to match only 555 and 777 , but not 10001 .

I've tried

~(?<!(&#)|\d])\d+(?![\d|;])~

But it seems to be too strict, as it returns no matches for cases like 777;

You can probably use this regex with lookarounds:

(?<!&#)\b\d+\b|(?:^|\b)\d+\b(?!;|$)

Demo: http://www.rubular.com/r/IUGqDf7Nfg

I've found the solution the next morning.

(?<![(&#)\d])\d+|\d+(?!\d|;)

It's quite big and poorly readable, but it works. PS I think it's a lot easier just do decode/hide the entities before processing and then put them back.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM