简体   繁体   中英

Reg Ex for specific number in string

I'd like to match numbers (int and real) in a string, but not if they are part of an identifier; eg, i'd like to match 5.5 or 42, but not x5. Strings are roughly of the form "x5*1.1+42*y=40". So far, I came up with

([0-9]*[.])?[0-9]+[^.*+=<>]

This correctly ignores x0, but also 0 or 0.5 (12.45, however, works). Changing the + to * leads to wrong matchings.

It would be very nice if someone could point out my error.

Thanks!

This is actually not simple. Float literals are more complex than you assumed, being able to contain an e or E for exponential format. Also, you can have prefixed signs ( + or - ) for the number and/or the exponent. All in all it can be done like this:

re.findall(r'(?:(?<![a-zA-Z_0-9])|[+-]\s*)[\d.]+(?:[eE][+-]?\d+)?',
           'x5*1.1+42*y=40+a123-3.14e-2')

This returns:

['1.1', '+42', '40', '-3.14e-2']

You should consider though whether a thing like 4+3 should lead to ['4', '3'] or ['4', '-3'] . If the input was 4+-3 the '-3' would clearly be preferable. But to distinguish these isn't easy and you should consider using a proper formula parser for these.

Maybe the standard module ast can help you. The expression must be a valid Python expression in this case, so a thing like a+b=40 isn't allowed because left of the equal sign is no proper lvalue . But for valid Python objects you could use ast like this:

import ast

def find_all_numbers(e):
  if isinstance(e, ast.BinOp):
    for r in find_all_numbers(e.left):
      yield r
    for r in find_all_numbers(e.right):
      yield r
  elif isinstance(e, ast.Num):
    yield e.n

list(find_all_numbers(ast.parse('x5*1.1+42*y-40').body[0].value))

Returns:

[1.1, 42, 40]

You could do it with something like

\b\d*(\.\d+)?\b

It matches any number of digits ( \\d* ) followed by an optional decimal part ( (\\.\\d+)? ). The \\b matches word boundaries , ie the location between a word character and a non word character. And since both digits and (english) letters are word characters , it won't match the 5 in a sequence like x5 .

See this regex101 example .

The main reason your try fails is that it ends with [^.*+=<>] which requires the number (or rather match ) to end with a character other than . , * , = , + , < or > . And when ending with a single digit, like 0 and 0.5 , the digit gets eaten by the [0-9]+ , and there's nothin to match the [^.*+=<>] left, and thus it fails. In the case with 12.45 it first matches 12.4 and then the [^.*+=<>] matches the 5 .

Do something like ((?<![a-zA-Z_])\\d+(\\.\\d+)?)

It is using negative lookbehind in order not to select anything having [a-zA-Z_] prior to it. Check it out here in Regex101 .

About your regex ([0-9]*[.])?[0-9]+[^.*+=<>] use [0-9]+ instead of [0-9]* as it will not allow .05 to be captured, only 0.5. Another thing is [^.*+=<>] this part, you could add ? to the end of it in order to allow it not to have characters as well. Example 1.1 wont be captured as ([0-9]*[.])?[0-9]+ is satisfied but not [^.*+=<>] that comes after it as well.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM