简体   繁体   中英

How to extract a specific type of number from a string using regex?

Consider this string:

text = '''
4 500,5

12%

1,63%

568768,74832 days in between

34 cars in a row'''

As you can see, there are simple numbers, numbers with spaces in between, numbers with comas, and both. Thus, 4 500,5 is considered as a standalone, separate number. Extracting the numbers with comas and spaces is easy and I found the pattern as:

pattern = re.compile(r'(\d+ )?\d+,\d+')

However, I am struggling to extract just the simple numbers like 12 and 34. I tried using (?!...) and [^...] but these options do not allow me to exclude the unwanted parts of other numbers.

((?:\\d+ )?\\d+,\\d+)|(\\d+(?! \\d))

I believe this will do what you want (Regexr link: https://regexr.com/695tc )

To capture "simple" numbers, it looks for [one or more digits], which are not followed by [a space and another digit].

I edited so that you can use capture groups appropriately, if desired.

If you only want to match 12 and 34:

(?<!\S)\d+\b(?![^\S\n]*[,\d])
  • (?<!\\S) Assert a whitespace boundary to the left
  • \\d+\\b Match 1+ digits and a word boundary
  • (?! Negative lookahead, assert what is directly to the right is not
    • [^\\S\\n]*[,\\d] Match optional spaces and either , or a digit
  • ) Close lookahead

Regex demo

I'd suggest extracting all numbers first, then filter those with a comma to a list with floats, and those without a comma into a list of integers:

import re
text = '4 500,5\n\n12%\n\n1,63%\n\n568768,74832 days in between\n\n34 cars in a row'
number_rx = r'(?<!\d)(?:\d{1,3}(?:[ \xA0]\d{3})*|\d+)(?:,\d+)?(?!\d)'
number_list = re.findall(number_rx, text)
print('Float: ', [x for x in number_list if ',' in x])
# => Float:  ['4 500,5', '1,63', '568768,74832']
print('Integers: ', [x for x in number_list if ',' not in x])
# => Integers:  ['12', '34']

See the Python demo and the regex demo .

The regex matches:

  • (?<!\\d) - a negative lookbehind that allows no digit immediately to the left of the current location
  • (?:\\d{1,3}(?:[ \\xA0]\\d{3})*|\\d+) - either of the two alternatives:
    • \\d{1,3}(?:[ \\xA0]\\d{3})* - one, two or three digits, and then zero or more occurrences of a space / hard (no-breaking) space followed with three digits
    • | - or
    • \\d+ - one or more digits
  • (?:,\\d+)? - an optional sequence of , and then one or more digits
  • (?!\\d) - a negative lookahead that allows no digit immediately to the right of the current location.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM