Consider this string:
text = '''
4 500,5
12%
1,63%
568768,74832 days in between
34 cars in a row'''
As you can see, there are simple numbers, numbers with spaces in between, numbers with comas, and both. Thus, 4 500,5
is considered as a standalone, separate number. Extracting the numbers with comas and spaces is easy and I found the pattern as:
pattern = re.compile(r'(\d+ )?\d+,\d+')
However, I am struggling to extract just the simple numbers like 12 and 34. I tried using (?!...)
and [^...]
but these options do not allow me to exclude the unwanted parts of other numbers.
((?:\\d+ )?\\d+,\\d+)|(\\d+(?! \\d))
I believe this will do what you want (Regexr link: https://regexr.com/695tc )
To capture "simple" numbers, it looks for [one or more digits], which are not followed by [a space and another digit].
I edited so that you can use capture groups appropriately, if desired.
If you only want to match 12 and 34:
(?<!\S)\d+\b(?![^\S\n]*[,\d])
(?<!\\S)
Assert a whitespace boundary to the left \\d+\\b
Match 1+ digits and a word boundary (?!
Negative lookahead, assert what is directly to the right is not
[^\\S\\n]*[,\\d]
Match optional spaces and either ,
or a digit )
Close lookahead I'd suggest extracting all numbers first, then filter those with a comma to a list with floats, and those without a comma into a list of integers:
import re
text = '4 500,5\n\n12%\n\n1,63%\n\n568768,74832 days in between\n\n34 cars in a row'
number_rx = r'(?<!\d)(?:\d{1,3}(?:[ \xA0]\d{3})*|\d+)(?:,\d+)?(?!\d)'
number_list = re.findall(number_rx, text)
print('Float: ', [x for x in number_list if ',' in x])
# => Float: ['4 500,5', '1,63', '568768,74832']
print('Integers: ', [x for x in number_list if ',' not in x])
# => Integers: ['12', '34']
See the Python demo and the regex demo .
The regex matches:
(?<!\\d)
- a negative lookbehind that allows no digit immediately to the left of the current location (?:\\d{1,3}(?:[ \\xA0]\\d{3})*|\\d+)
- either of the two alternatives:
\\d{1,3}(?:[ \\xA0]\\d{3})*
- one, two or three digits, and then zero or more occurrences of a space / hard (no-breaking) space followed with three digits |
- or \\d+
- one or more digits (?:,\\d+)?
- an optional sequence of ,
and then one or more digits (?!\\d)
- a negative lookahead that allows no digit immediately to the right of the current location.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.