简体   繁体   English

只获取最后的数字(正则表达式)

[英]Get only numbers at the end (regex)

I'd like to get only the numbers (integers) at the end of the phrases below:只想得到以下短语末尾的数字(整数):

VISTA AES TIETE E UNT N2 600 
VISTA IT AUUNIBANCO PN N1 1.400
OPCAO DE VENDA 04/21 COGNP450ON 4,50COGNE 100.000

I mean: 600, 1400, 100000. I'll add each one of them to a database later.我的意思是:600、1400、100000。稍后我会将它们中的每一个添加到数据库中。

I tried to use regex: (?<=\s)(\d*\s*)|(\d*.\d*)$我尝试使用正则表达式: (?<=\s)(\d*\s*)|(\d*.\d*)$

But it didn't work properly.但它没有正常工作。 Any ideas?有任何想法吗?

PS: We use dots, not commas to represent a thousand: 1.000, instead of 1,000. PS:我们使用点,而不是逗号来表示一千:1.000,而不是 1,000。

Actually for your use case, I don't think you'd even need regex实际上对于您的用例,我认为您甚至不需要regex

You can just split the strings and take the last one, and replace dot by empty string您可以只拆分字符串并取最后一个,然后用空字符串替换点

If it's dataframe (since you have tagged Pandas ),如果是 dataframe (因为您已标记Pandas ),

> df['colName'].str.split().str[-1].str.replace('.', '')
0       600
1      1400
2    100000
Name: colName, dtype: object

If it's list of strings如果是字符串列表

> list(map(lambda x: x.replace('.', ''),map(lambda x: x.split()[-1], data)))
['600', '1400', '100000']
l = ["VISTA AES TIETE E UNT N2 600",
"VISTA IT AUUNIBANCO PN N1 1.400",
"OPCAO DE VENDA 04/21 COGNP450ON 4,50COGNE 100.000"]

If the data is in the form of dataframe.如果数据是dataframe的形式。

df=DataFrame({
    'col':l
})
df.col.str.extract('(\d*\.*\d*)?$').astype(str).replace('\.','', regex=True)

Output Output

0   600
1   1400
2   100000

In the pattern that you tried, this part (?<=\s)(\d*\s*) matches optional digits, followed by optional whitespace chars while there must be a whitespace char directly to the left.在您尝试的模式中,这部分(?<=\s)(\d*\s*)匹配可选数字,后跟可选的空白字符,而左侧必须有一个空白字符。

That will also get all the positions in the string where there is a whitspace char to the left, as the digits and the whitespace char in the match are optional.这也将获得字符串中左侧有一个空白字符的所有位置,因为匹配中的数字和空白字符是可选的。

In this part (\d*\.\d*)$ the digits are optional, so it could also match just a dot at the end of the string.在这部分(\d*\.\d*)$中,数字是可选的,因此它也可以只匹配字符串末尾的一个点。


If there has to be a whitespace char before the number at the end, you can use:如果最后的数字之前必须有一个空格字符,您可以使用:

(?<=\s)\d{1,3}(?:\.\d{3})*$

The pattern matches:模式匹配:

  • (?<=\s) Positive lookbehind, assert a whitspace char to the left from the current position (?<=\s)正向向后看,从当前 position 向左断言一个空白字符
  • \d{1,3} Match 1-3 digits \d{1,3}匹配 1-3 位数字
  • (?:\.\d{3})* Optionally repeat a dot and 3 digits (?:\.\d{3})*可选择重复一个点和 3 位数字
  • $ End of string $字符串结尾

See a regex demo .查看正则表达式演示

If the number can also be by itself, you could assert a whitespace boundary to the left (?<!\S)如果数字也可以单独存在,则可以在左侧声明一个空白边界(?<!\S)

(?<!\S)\d{1,3}(?:\.\d{3})*$

See another regex demo .查看另一个正则表达式演示

For example, using str.extract and wrapping the pattern in a capture group:例如,使用str.extract并将模式包装在捕获组中:

import pandas as pd

strings = [
    "VISTA AES TIETE E UNT N2 600",
    "VISTA IT AUUNIBANCO PN N1 1.400",
    "OPCAO DE VENDA 04/21 COGNP450ON 4,50COGNE 100.000"
]

df = pd.DataFrame(strings, columns=["colName"])
df['lastNumbers'] = df['colName'].str.extract(r"(?<=\s)(\d{1,3}(?:\.\d{3})*)$")

print(df)

Output Output

                                             colName lastNumbers
0                       VISTA AES TIETE E UNT N2 600         600
1                    VISTA IT AUUNIBANCO PN N1 1.400       1.400
2  OPCAO DE VENDA 04/21 COGNP450ON 4,50COGNE 100.000     100.000

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM