简体   繁体   中英

Non-greedy regex in Python

Given the text:

'Adf adf asdf asdf asfdf https://.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf https://.com/abcabcabc\\n kdfja ladsjfladsjf ladksjf ladsjfl adsf https://.com/djflkajdsfl\\n\\n djldjfld djfladjf ldfdjlkfj ldfj.'

How can I match any url in the form https://.com/subdir[until it hits some space or new line, comma, or full-stop]?

Tried:

re.findall('http.*',s) 
['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf https://<somepage>.com/abcabcabc', 'https://<somepage>.com/djflkajdsfl']

re.findall('http.* ',s) 
['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf ']

re.findall('http.* ?',s) 
['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf https://<somepage>.com/abcabcabc', 'https://<somepage>.com/djflkajdsfl']

re.findall('http.* {1}?',s) 
['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf ']

re.findall('http.* +?',s) 
['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf ']

re.findall('http.*[^ \n]',s) 
['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf
https://<somepage>.com/abcabcabc', 'https://<somepage>.com/djflkajdsfl']

re.findall('http.*[^ \\n]',s) ['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf
https://<somepage>.com/abcabcabc', 'https://<somepage>.com/djflkajdsfl']

re.findall('http.*[^ \\\n]',s) ['https://<somepage>.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf
https://<somepage>.com/abcabcabc', 'https://<somepage>.com/djflkajdsfl']

re.findall('http.* *?',s) ['https://imgur.com/abcabcabc kdfja ladsjfladsjf ladksjf ladsjfl adsfadf adf asdf asdf asfdf https://imgur.com/abcabcabc', 'https://somepage.com/djflkajdsfl']

Try the following:

re.findall('http[^ \n,]*',s)

You can view this in action here .

Since you're using the . , neither lazy ( .*? ) nor greedy ( .* ) will work for you. Lazy will move only one character and then stop, whereas greedy will continue on until the end.

Instead, you want to specify which characters you do not want. ( [^ \\n,] ) and do your search on that. Since you want to stop at the first instance of those characters, you want to use a greedy search to do this.

Since the . character is legal inside URL's, it is difficult to limit the string based on that. Since you always want to include a subdirectory, you can accomplish this with the following:

re.findall('http[^ \\n,]*/[^ \\n,\.]*',s)

You can view this in action here .

The problem in your 1st example isn't that the regexp is matching too many spaces; it's matching too many letters before a space. So don't put your "non-greedy" ? modifier after spaces, put it after the .* because that's what is currently matching too much.

py3.7 >>> re.findall('http.*? ', s)
['https://.com/abcabcabc ']

On the other hand, [^ \\n] is not a modifier of any sort – it's a full match expression on its own. So putting it after an existing expression won't make it match less; you now have two match expressions which together match more.

You have to use it in place of the expression which matches too much, namely instead of the . :

py3.7 >>> re.findall('http[^ \n]*', s)
['https://.com/abcabcabc', 'https://.com/abcabcabc', 'https://.com/djflkajdsfl']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM