简体   繁体   English

使用Python正则表达式从字符串中提取十进制数字

[英]Extracting decimal numbers from string with Python regex

I tried this using re library of Python. 我尝试使用Python的re库进行此操作。 From a file i get several lines that contains elements separated by bars ('|'). 从文件中,我得到几行包含用竖线('|')分隔的元素的行。 I put them in a list and what I need is to get the numbers inside in order to operate with them. 我将它们放在列表中,而我需要输入的数字才能与它们一起使用。

This would be one of the strings I want to split: 这将是我要拆分的字符串之一:

>>print(line_input)
>>[240, 7821, 0, 12, 605, 0, 3]|[1.5, 7881.25, 0, 543, 876, 0, 121]|[237, 761, 0, 61, 7, 605, 605]

and my intention is to form a vector with each of the elements between square brackets. 我的意图是用方括号之间的每个元素形成一个向量。

I created this regular expression 我创建了这个正则表达式

>>test_pattern="\|\[(\d*(\.\d+)?), (\d*(\.\d+)?), (\d*(\.\d+)?)]"

but the results are a bit confusing. 但是结果有点令人困惑。 In particular, the result is 特别是,结果是

>>vectors = re.findall(test_pattern, line_input)

>>print(vectors)
>>[('240', '', '7821', '', '0', '', '12', '', '605', '', '0', '', '3', ''), ('1.5', '.5', '7881.25', '.25', '0', '', '0', '', '0', '', '0', '', '0', ''), ('23437', '', '76611', '', '0', '', '0', '', '0', '', '605', '', '605', '')]

I don´t understand where the white spaces come from nor why the decimal part gets duplicated. 我不知道空格是从哪里来的,也不知道为什么小数部分会重复。 I know that I almost get it, at least, I´m sure it´sa small simple detail, but I don't get. 我知道我几乎明白了,至少,我确定这是一个很小的简单细节,但我没有。

Thank you very much in advance. 提前非常感谢您。

Those blanks are the empty possible decimals. 这些空格是可能的空小数。 Your vectors variable contains all capturing groups, whether empty or not. vectors变量包含所有捕获组,无论是否为空。 So when there is a decimal, you're getting one match of the outside group (\\d*(\\.\\d+)?) , and one for the inside group (\\.\\d+)? 因此,当有一个小数时,您会得到一个外部组(\\d*(\\.\\d+)?)匹配项,而一个是内部组(\\.\\d+)?匹配项(\\.\\d+)? . Make the inside a non-capturing group: 将内部设为非捕获组:

(\\d+(?:\\.\\d+)?)

Note: I also changed it to require a number before the decimal (if any). 注意:我还更改了它,要求小数点前有一个数字(如果有)。

Another (potentially non-robust if the input format differs) way to do this would be to split the string on ']|[' to get the lists, and then split on ', ' to get the values: 这样做的另一种方法(如果输入格式不同,则可能不是很健壮)是将字符串拆分为[] | []以获取列表,然后拆分为','以获取值:

from decimal import Decimal
input_str = '[240, 7821, 0, 12, 605, 0, 3]|[1.5, 7881.25, 0, 543, 876, 0, 121]|[237, 761, 0, 61, 7, 605, 605]'

# ignore the first and last '[' and ']' chars, then split on list separators
list_strs = input_str[1:-1].split(']|[')

# Split on ', ' to get individual decimal values
int_lists = [[Decimal(i) for i in s.split(', ')] for s in list_strs]

# int_lists contains a list of lists of decimal values, like the input format

for l in int_lists:
    print(', '.join(str(d) for d in l))

Result : 结果

240, 7821, 0, 12, 605, 0, 3
1.5, 7881.25, 0, 543, 876, 0, 121
237, 761, 0, 61, 7, 605, 605

regex has its place. 正则表达式有它的位置。 However, grammars written with pyparsing are often easier to write — and easier to read. 但是,用pyparsing编写的语法通常更易于编写和阅读。

>>> import pyparsing as pp

The numbers are like words made out of digits and period/full stop characters. 数字就像是由数字和句点/句号组成的单词。 They are optionally followed by commas which we can simply suppress. 它们后面可以有逗号,我们可以简单地取消它们。

>>> number = pp.Word(pp.nums+'.') + pp.Optional(',').suppress()

One of the lists consists of a left square bracket, which we suppress, followed by one or more numbers (as just defined), followed by a right square bracket, which we also suppress, followed by an optional bar character, again suppressed. 列表之一包括一个左方括号(我们将其取消显示),一个或多个数字(如刚刚定义的),一个右方括号(我们也将其删除)以及一个可选的直角字符(再次被抑制)组成。 (Incidentally, this bar is, to some degree, redundant because the right bracket closes the list.) (顺便说一句,此栏在某种程度上是多余的,因为右括号将列表关闭了。)

We apply Group to the entire construct so that pyparsing will organise the items we have not suppressed into separate Python lists for us. 我们将Group应用于整个构造,以便pyparsing将我们未压缩的项目组织到单独的Python列表中。

>>> one_list = pp.Group(pp.Suppress('[') + pp.OneOrMore(number) + pp.Suppress(']') + pp.Suppress(pp.Optional('|')))

The whole collection of lists is just one or more lists. 列表的整个集合只是一个或多个列表。

>>> whole = pp.OneOrMore(one_list)

Here's the input, 这是输入

>>> line_input = '[240, 7821, 0, 12, 605, 0, 3]|[1.5, 7881.25, 0, 543, 876, 0, 121]|[237, 761, 0, 61, 7, 605, 605]'

... which we parse into result r . ...我们将其解析为结果r

>>> r = whole.parseString(line_input)

We can display the resulting lists. 我们可以显示结果列表。

>>> r[0]
(['240', '7821', '0', '12', '605', '0', '3'], {})
>>> r[1]
(['1.5', '7881.25', '0', '543', '876', '0', '121'], {})
>>> r[2]
(['237', '761', '0', '61', '7', '605', '605'], {})

More likely, we would want the numbers as numbers. 我们更希望将数字作为数字。 In this situation, we know that the strings in the lists represent either floats or integers. 在这种情况下,我们知道列表中的字符串代表浮点数或整数。

>>> for l in r.asList():
...     [int(_) if _.isnumeric() else float(_) for _ in l]
... 
[240, 7821, 0, 12, 605, 0, 3]
[1.5, 7881.25, 0, 543, 876, 0, 121]
[237, 761, 0, 61, 7, 605, 605]

You can try this: 您可以尝试以下方法:

import re
s = "[240, 7821, 0, 12, 605, 0, 3]|[1.5, 7881.25, 0, 543, 876, 0, 121]|[237, 761, 0, 61, 7, 605, 605]" 
data = re.findall("\d+\.*\d+", s)

Output: 输出:

['240', '7821', '12', '605', '1.5', '7881.25', '543', '876', '121', '237', '761', '61', '605', '605']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM