简体   繁体   中英

Extracting decimal numbers from string with Python regex

I tried this using re library of Python. From a file i get several lines that contains elements separated by bars ('|'). I put them in a list and what I need is to get the numbers inside in order to operate with them.

This would be one of the strings I want to split:

>>print(line_input)
>>[240, 7821, 0, 12, 605, 0, 3]|[1.5, 7881.25, 0, 543, 876, 0, 121]|[237, 761, 0, 61, 7, 605, 605]

and my intention is to form a vector with each of the elements between square brackets.

I created this regular expression

>>test_pattern="\|\[(\d*(\.\d+)?), (\d*(\.\d+)?), (\d*(\.\d+)?)]"

but the results are a bit confusing. In particular, the result is

>>vectors = re.findall(test_pattern, line_input)

>>print(vectors)
>>[('240', '', '7821', '', '0', '', '12', '', '605', '', '0', '', '3', ''), ('1.5', '.5', '7881.25', '.25', '0', '', '0', '', '0', '', '0', '', '0', ''), ('23437', '', '76611', '', '0', '', '0', '', '0', '', '605', '', '605', '')]

I don´t understand where the white spaces come from nor why the decimal part gets duplicated. I know that I almost get it, at least, I´m sure it´sa small simple detail, but I don't get.

Thank you very much in advance.

Those blanks are the empty possible decimals. Your vectors variable contains all capturing groups, whether empty or not. So when there is a decimal, you're getting one match of the outside group (\\d*(\\.\\d+)?) , and one for the inside group (\\.\\d+)? . Make the inside a non-capturing group:

(\\d+(?:\\.\\d+)?)

Note: I also changed it to require a number before the decimal (if any).

Another (potentially non-robust if the input format differs) way to do this would be to split the string on ']|[' to get the lists, and then split on ', ' to get the values:

from decimal import Decimal
input_str = '[240, 7821, 0, 12, 605, 0, 3]|[1.5, 7881.25, 0, 543, 876, 0, 121]|[237, 761, 0, 61, 7, 605, 605]'

# ignore the first and last '[' and ']' chars, then split on list separators
list_strs = input_str[1:-1].split(']|[')

# Split on ', ' to get individual decimal values
int_lists = [[Decimal(i) for i in s.split(', ')] for s in list_strs]

# int_lists contains a list of lists of decimal values, like the input format

for l in int_lists:
    print(', '.join(str(d) for d in l))

Result :

240, 7821, 0, 12, 605, 0, 3
1.5, 7881.25, 0, 543, 876, 0, 121
237, 761, 0, 61, 7, 605, 605

regex has its place. However, grammars written with pyparsing are often easier to write — and easier to read.

>>> import pyparsing as pp

The numbers are like words made out of digits and period/full stop characters. They are optionally followed by commas which we can simply suppress.

>>> number = pp.Word(pp.nums+'.') + pp.Optional(',').suppress()

One of the lists consists of a left square bracket, which we suppress, followed by one or more numbers (as just defined), followed by a right square bracket, which we also suppress, followed by an optional bar character, again suppressed. (Incidentally, this bar is, to some degree, redundant because the right bracket closes the list.)

We apply Group to the entire construct so that pyparsing will organise the items we have not suppressed into separate Python lists for us.

>>> one_list = pp.Group(pp.Suppress('[') + pp.OneOrMore(number) + pp.Suppress(']') + pp.Suppress(pp.Optional('|')))

The whole collection of lists is just one or more lists.

>>> whole = pp.OneOrMore(one_list)

Here's the input,

>>> line_input = '[240, 7821, 0, 12, 605, 0, 3]|[1.5, 7881.25, 0, 543, 876, 0, 121]|[237, 761, 0, 61, 7, 605, 605]'

... which we parse into result r .

>>> r = whole.parseString(line_input)

We can display the resulting lists.

>>> r[0]
(['240', '7821', '0', '12', '605', '0', '3'], {})
>>> r[1]
(['1.5', '7881.25', '0', '543', '876', '0', '121'], {})
>>> r[2]
(['237', '761', '0', '61', '7', '605', '605'], {})

More likely, we would want the numbers as numbers. In this situation, we know that the strings in the lists represent either floats or integers.

>>> for l in r.asList():
...     [int(_) if _.isnumeric() else float(_) for _ in l]
... 
[240, 7821, 0, 12, 605, 0, 3]
[1.5, 7881.25, 0, 543, 876, 0, 121]
[237, 761, 0, 61, 7, 605, 605]

You can try this:

import re
s = "[240, 7821, 0, 12, 605, 0, 3]|[1.5, 7881.25, 0, 543, 876, 0, 121]|[237, 761, 0, 61, 7, 605, 605]" 
data = re.findall("\d+\.*\d+", s)

Output:

['240', '7821', '12', '605', '1.5', '7881.25', '543', '876', '121', '237', '761', '61', '605', '605']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM