I have a dataset with strings that include phrases + temperatures:
ex. string = "New York is humid with 15.43C". only_temp="15.43C"
My code: -->If only_temp="15.43C" then
re.search('\d+.\d\dC', string)
finds it.
-->If "New York is humid with 15.43C" .
re.search("(.*)(\d+.\d\dC)", string)
finds 2 groups that " New York is humid with" and "5.43C" . ( Instead of 15.43C)
I believe that the problem is in.* but I cannot find a solution.
In
re.search("(.*)(\d+.\d\dC)",string)
the '(.*)'
greedily capturing anything. The following '(\d+.\d\dC)'
captures greedily as well - but it only enforces 1+ digits before the dot. That is why the first expression captures the 1
.
Make it non-greedy:
re.search("(.*?)(\d+.\d\dC)",string)
so the first expression only captures lazy / he least it has to. Then the followup will capture the full degrees centigrade. You may as well make the first non-capturing if you do not need it at all:
re.search("(?:.*?)(\d+.\d\dC)",string)
Demo:
import re
string = "New York is humid with 15.43C"
only_temp ="15.43C"
s = re.search("(?:.*?)(\d+.\d\dC)", string)
print(s.groups())
Output:
('15.43C',)
I think you could simply your regex:
import re
s = 'New York is humid with 15.43C'
re.search("([\d\.]+C$)", s).groups()
OUTPUT
('15.43C',)
While it is true that [\d\.]+
is more free form than \d+.\d\d
- I think it is reasonably safe to assume that a combination of numbers and dots express the temperature.
For example if your sentence is like:
s = 'New York is humid with 16C'
a more restrictive pattern won't return any match.
In any case, note that the dot needs to be escaped - given that in regex .
means any character, otherwise:
s = "New York is humid with 15A43C"
re.search("(?:.*?)(\d+.\d\dC)", s).groups()
will return a match
OUTPUT
('15A43C',)
I do understand that it is rationaly and reasonably safe to assume that \d+.\d\dC
will generally match a Celsius temperature - I am just saying that you are not matching a dot, if that is the intention.
There were several mistakes in your regex:
.
will match any character, not a period , use \.
to match the decimal separators = 'New York is humid with 15.43C'
m = re.search('(.*?)\s*(\d+\.\d{,2}C)', s)
m.groups()
# ('New York is humid with', '15.43C')
If you want to handle the case where there is no decimal part:
s = 'New York is humid with 15C'
m = re.search('(.*?)\s*(\d+(?:\.\d{,2})?C)', s)
m.groups()
# ('New York is humid with', '15C')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.