Regular expression to match numbers, but not HTML entities

Question

Is there a regex to find all the digit sequences ( \\d+ ) in text, but not the ones forming HTML entities? Look like I should use both "look ahead" and "look behind" together, but I can't figure out how.

For example, for the string ✑ #555 foo 777; ✑ #555 foo 777; I want to match only 555 and 777 , but not 10001 .

I've tried

~(?<!(&#)|\d])\d+(?![\d|;])~

But it seems to be too strict, as it returns no matches for cases like 777;

Answer 1

You can probably use this regex with lookarounds:

(?<!&#)\b\d+\b|(?:^|\b)\d+\b(?!;|$)

Demo: http://www.rubular.com/r/IUGqDf7Nfg

Answer 2

I've found the solution the next morning.

(?<![(&#)\d])\d+|\d+(?!\d|;)

It's quite big and poorly readable, but it works. PS I think it's a lot easier just do decode/hide the entities before processing and then put them back.

Regular expression to match numbers, but not HTML entities

Question

2 answers

solution1
0 2013-11-22 13:41:52

Demo: http://www.rubular.com/r/IUGqDf7Nfg

solution2
0 ACCPTED 2013-11-23 11:34:52

Regular expression to match numbers, but not HTML entities

Question

2 answers

solution1 0 2013-11-22 13:41:52

Demo: http://www.rubular.com/r/IUGqDf7Nfg

solution2 0 ACCPTED 2013-11-23 11:34:52

solution1
0 2013-11-22 13:41:52

solution2
0 ACCPTED 2013-11-23 11:34:52