I have a string x like this while trying to scrape a text on website and encounter this problem:
x = 'P. 4. (5.2x4) abc 5.2x4 aoohrwow'
I want to create a regex pattern that can take out the 5.2x4
Due to user's ways is not in the same pattern when they type dimensions, i've had this pattern which can cover almost every circumstances (string y, z,v for examples) but the x string above.
y = 'abc 5.2 x 4 nsdf'
z = 'abc (5.2)3x 4ohsdf'
v = 'abc 5.2(4.) x4. qoqwh'
With my pattern, what i got from x string is 4.(5.2x4)
but I need 5.2x4
part.
My pattern so far:
p = r'[(]?(\s)?(\d+)?(\.)?(\d+)?(\s)?[)]?(\s)?[(]?(\s)?\d+(\.)?(\d+)?(\s)?[)]?' \
r'(\s)?x(\s)?' \
r'[(]?(\s)?(\d+)?(\.)?(\d+)?(\s)?[)]?(\s)?[(]?(\s)?\d+(\.)?(\d+)?(\s)?[)]?'
Can anyone help me on this? Thank you for your time.
Edit In general, what I need to extract from a string will have pattern like this:
(1.2)(3.4)x(5.6)
Strings I scrape from website can miss some part of this. In the x
string above, my code miscount 4.
in P. 4.
as one of the dimension part but in fact, it is not.
Can I search a pattern in the string from end to start? If so, I can solve this problem
Since your examples vary so much from one to the other, and I'm assuming you want exactly 5.2x4
from each one, I'll give you a snippet of code that does work for the examples that work, and let me know if there are some that fail this:
import re
examples = ["P. 4. (5.2x4) abc 5.2x4 aoohrwow",
"abc 5.2 x 4 nsdf", "abc (5.2)3x 4ohsdf",
"abc 5.2(4.) x4. qoqwh"]
for x in examples:
stripped = x.replace(' ', '')
matches = list(map(lambda match: ''.join(match), re.findall(r"(\d+\.\d+)[^x]*?(x\d+)", stripped)))
print(matches)
"""
output:
['5.2x4', '5.2x4']
['5.2x4']
['5.2x4']
['5.2x4']
"""
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.