Regex how to search more than one digits in a phrase

Question

I have a dataset with strings that include phrases + temperatures:

ex. string = "New York is humid with 15.43C". only_temp="15.43C"

My code: -->If only_temp="15.43C" then

re.search('\d+.\d\dC', string) finds it.

-->If "New York is humid with 15.43C" .

re.search("(.*)(\d+.\d\dC)", string) finds 2 groups that " New York is humid with" and "5.43C" . ( Instead of 15.43C)

I believe that the problem is in.* but I cannot find a solution.

Answer 1

In

re.search("(.*)(\d+.\d\dC)",string)

the '(.*)' greedily capturing anything. The following '(\d+.\d\dC)' captures greedily as well - but it only enforces 1+ digits before the dot. That is why the first expression captures the 1 .

Make it non-greedy:

re.search("(.*?)(\d+.\d\dC)",string)

so the first expression only captures lazy / he least it has to. Then the followup will capture the full degrees centigrade. You may as well make the first non-capturing if you do not need it at all:

re.search("(?:.*?)(\d+.\d\dC)",string)

Demo:

import re
string = "New York is humid with 15.43C"
only_temp ="15.43C"

s =  re.search("(?:.*?)(\d+.\d\dC)", string)
print(s.groups())

Output:

('15.43C',)

Answer 2

I think you could simply your regex:

import re

s = 'New York is humid with 15.43C'

re.search("([\d\.]+C$)", s).groups()

OUTPUT

('15.43C',)

While it is true that [\d\.]+ is more free form than \d+.\d\d - I think it is reasonably safe to assume that a combination of numbers and dots express the temperature.

For example if your sentence is like:

s = 'New York is humid with 16C'

a more restrictive pattern won't return any match.

In any case, note that the dot needs to be escaped - given that in regex . means any character, otherwise:

s = "New York is humid with 15A43C"
re.search("(?:.*?)(\d+.\d\dC)", s).groups()

will return a match

OUTPUT

('15A43C',)

I do understand that it is rationaly and reasonably safe to assume that \d+.\d\dC will generally match a Celsius temperature - I am just saying that you are not matching a dot, if that is the intention.

Answer 3

There were several mistakes in your regex:

. will match any character, not a period , use \. to match the decimal separator
make your first group non greedy to consume only the minimum needed to match the second group

s = 'New York is humid with 15.43C'
m = re.search('(.*?)\s*(\d+\.\d{,2}C)', s)
m.groups()
# ('New York is humid with', '15.43C')

If you want to handle the case where there is no decimal part:

s = 'New York is humid with 15C'
m = re.search('(.*?)\s*(\d+(?:\.\d{,2})?C)', s)
m.groups()
# ('New York is humid with', '15C')

Regex how to search more than one digits in a phrase

Question

3 answers

solution1
0 2022-01-01 10:17:54

solution2
0 2022-01-01 10:32:13

solution3
0 2022-01-01 10:49:57

Regex how to search more than one digits in a phrase

Question

3 answers

solution1 0 2022-01-01 10:17:54

solution2 0 2022-01-01 10:32:13

solution3 0 2022-01-01 10:49:57

solution1
0 2022-01-01 10:17:54

solution2
0 2022-01-01 10:32:13

solution3
0 2022-01-01 10:49:57