正则表达式如何在短语中搜索多个数字

Question

I have a dataset with strings that include phrases + temperatures:我有一个包含短语+温度的字符串数据集：

ex.前任。 string = "New York is humid with 15.43C". string = "纽约潮湿，气温为 15.43C"。 only_temp="15.43C" only_temp="15.43C"

My code: -->If only_temp="15.43C" then我的代码： -->If only_temp="15.43C" 那么

re.search('\d+.\d\dC', string) finds it. re.search('\d+.\d\dC', string)找到它。

-->If "New York is humid with 15.43C" . -->如果“纽约潮湿，气温为 15.43C” 。

re.search("(.*)(\d+.\d\dC)", string) finds 2 groups that " New York is humid with" and "5.43C" . re.search("(.*)(\d+.\d\dC)", string)找到“纽约潮湿”和“5.43C”的2组。 ( Instead of 15.43C) （而不是 15.43C）

I believe that the problem is in.* but I cannot find a solution.我相信问题出在。*但我找不到解决方案。

Answer 1

In在

re.search("(.*)(\d+.\d\dC)",string)

the '(.*)' greedily capturing anything. '(.*)'贪婪地捕捉任何东西。 The following '(\d+.\d\dC)' captures greedily as well - but it only enforces 1+ digits before the dot.以下'(\d+.\d\dC)'也贪婪地捕获 - 但它只在点之前强制执行 1+ 位。 That is why the first expression captures the 1 .这就是第一个表达式捕获1的原因。

Make it non-greedy:让它不贪婪：

re.search("(.*?)(\d+.\d\dC)",string)

so the first expression only captures lazy / he least it has to.所以第一个表达式只捕获lazy / he 至少它必须。 Then the followup will capture the full degrees centigrade.然后后续将捕获全摄氏度。 You may as well make the first non-capturing if you do not need it at all:如果您根本不需要它，您也可以进行第一个非捕获：

re.search("(?:.*?)(\d+.\d\dC)",string)

Demo:演示：

import re
string = "New York is humid with 15.43C"
only_temp ="15.43C"

s =  re.search("(?:.*?)(\d+.\d\dC)", string)
print(s.groups())

Output: Output：

('15.43C',)

Answer 2

I think you could simply your regex:我认为你可以简单地使用你的正则表达式：

import re

s = 'New York is humid with 15.43C'

re.search("([\d\.]+C$)", s).groups()

OUTPUT OUTPUT

('15.43C',)

While it is true that [\d\.]+ is more free form than \d+.\d\d - I think it is reasonably safe to assume that a combination of numbers and dots express the temperature.虽然[\d\.]+确实比\d+.\d\d更自由的形式- 我认为假设数字和点的组合表示温度是相当安全的。

For example if your sentence is like:例如，如果你的句子是这样的：

s = 'New York is humid with 16C'

a more restrictive pattern won't return any match.更严格的模式不会返回任何匹配项。

In any case, note that the dot needs to be escaped - given that in regex .在任何情况下，请注意点需要转义 - 考虑到 regex . means any character, otherwise:表示任何字符，否则：

s = "New York is humid with 15A43C"
re.search("(?:.*?)(\d+.\d\dC)", s).groups()

will return a match将返回匹配

OUTPUT OUTPUT

('15A43C',)

I do understand that it is rationaly and reasonably safe to assume that \d+.\d\dC will generally match a Celsius temperature - I am just saying that you are not matching a dot, if that is the intention.我确实理解假设\d+.\d\dC通常会匹配摄氏温度是合理且合理的安全 - 我只是说你不匹配一个点，如果这是意图的话。

Answer 3

There were several mistakes in your regex:您的正则表达式中有几个错误：

. will match any character, not a period , use \.将匹配任何字符，而不是句点，使用\. to match the decimal separator匹配小数点分隔符
make your first group non greedy to consume only the minimum needed to match the second group使您的第一组不贪婪，仅消耗与第二组匹配所需的最小值

s = 'New York is humid with 15.43C'
m = re.search('(.*?)\s*(\d+\.\d{,2}C)', s)
m.groups()
# ('New York is humid with', '15.43C')

If you want to handle the case where there is no decimal part:如果要处理没有小数部分的情况：

s = 'New York is humid with 15C'
m = re.search('(.*?)\s*(\d+(?:\.\d{,2})?C)', s)
m.groups()
# ('New York is humid with', '15C')

正则表达式如何在短语中搜索多个数字

问题描述

3 个解决方案

解决方案1
0 2022-01-01 10:17:54

解决方案2
0 2022-01-01 10:32:13

解决方案3
0 2022-01-01 10:49:57

正则表达式如何在短语中搜索多个数字

问题描述

3 个解决方案

解决方案1 0 2022-01-01 10:17:54

解决方案2 0 2022-01-01 10:32:13

解决方案3 0 2022-01-01 10:49:57

解决方案1
0 2022-01-01 10:17:54

解决方案2
0 2022-01-01 10:32:13

解决方案3
0 2022-01-01 10:49:57