简体   繁体   中英

Why is non-greedy Python Regex not non-greedy enough?

I've implemented non-greedy regex on a group of string URLs, where I'm trying to clean them up so that they end after the .com (.co.uk etc). Some of them continued with ' or " or < after the desired cutoff, and so I used x = re.findall('([A-Za-z0-9]+@\\S+.co\\S*?)[\\'"<]', finalSoup2) .

The problem is that some URLs are misc@misc.misc'misc''misc' (or similar with < >) and so after implementing the non-greedy regex I'm still left with enquiries@smart-traffic.com.au">enquiries@smart-traffic.com.au , for example.

I've tried two ?? 's together, but obviously not working, so what's they proper way to acheive clean URLs in this situation?

The issue with your regex is that you currently are only looking for Non-spaces(period)co instead of looking for Non-spaces(period)Non-spaces.

So in this case you could get away with the following regex based on the information above.

>>> finalSoup2 = """
... misc@misc.misc'misc''misc
... enquiries@smart-traffic.com.au">enquiries@smart-traffic.com.au
... google.com
... google.co.uk"'<>Stuff
... """
>>>x = re.findall('([A-Za-z0-9]+@[^\'"<>]+)[\'"<]', finalSoup2)
>>>x
['misc@misc.misc',
 'enquiries@smart-traffic.com.au',
 'enquiries@smart-traffic.com.au\ngoogle.com\ngoogle.co.uk']

Which you can then use to get the urls that you'd like but you'd have to make sure to split them on r'\\n' as they may have a newline character within the text as seen above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM