简体   繁体   中英

Regular expression patterns

I have a string x like this while trying to scrape a text on website and encounter this problem:

x = 'P. 4. (5.2x4) abc 5.2x4 aoohrwow'

I want to create a regex pattern that can take out the 5.2x4 Due to user's ways is not in the same pattern when they type dimensions, i've had this pattern which can cover almost every circumstances (string y, z,v for examples) but the x string above.

y = 'abc 5.2 x 4 nsdf'

z = 'abc (5.2)3x 4ohsdf'

v = 'abc 5.2(4.) x4. qoqwh'

With my pattern, what i got from x string is 4.(5.2x4) but I need 5.2x4 part.

My pattern so far:

p = r'[(]?(\s)?(\d+)?(\.)?(\d+)?(\s)?[)]?(\s)?[(]?(\s)?\d+(\.)?(\d+)?(\s)?[)]?' \
    r'(\s)?x(\s)?' \
    r'[(]?(\s)?(\d+)?(\.)?(\d+)?(\s)?[)]?(\s)?[(]?(\s)?\d+(\.)?(\d+)?(\s)?[)]?'

Can anyone help me on this? Thank you for your time.

Edit In general, what I need to extract from a string will have pattern like this:

(1.2)(3.4)x(5.6)

Strings I scrape from website can miss some part of this. In the x string above, my code miscount 4. in P. 4. as one of the dimension part but in fact, it is not.

Can I search a pattern in the string from end to start? If so, I can solve this problem

Since your examples vary so much from one to the other, and I'm assuming you want exactly 5.2x4 from each one, I'll give you a snippet of code that does work for the examples that work, and let me know if there are some that fail this:

import re


examples = ["P. 4. (5.2x4) abc 5.2x4 aoohrwow",
"abc 5.2 x 4 nsdf", "abc (5.2)3x 4ohsdf",
"abc 5.2(4.) x4. qoqwh"]

for x in examples:
    stripped = x.replace(' ', '')
    matches = list(map(lambda match: ''.join(match), re.findall(r"(\d+\.\d+)[^x]*?(x\d+)", stripped)))
    
    print(matches)

"""
output:
['5.2x4', '5.2x4']
['5.2x4']
['5.2x4']
['5.2x4']
"""

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM