简体   繁体   中英

Python re.findall returning links with unwanted string afterwards

I have a python script using BeautifulSoup to scrape. This is my code:

re.findall('stream:\/\/.+', link)

Which is designed to find links like:

stream://987cds9c8ujru56236te2ys28u99u2s

But it also returns strings like this:

stream://987cds9c8ujru56236te2ys28u99u2s  [SD] Spanish - (9.15am)

ie with spaces and extra stuff which I don't want. How can I express the

re.findall

So it only returns the link first part?

(Thanks in advance)

You can use a non-greedy match (adding ? to the pattern) with a word boundary character '\\b' :

>>> re.findall(r'stream:\/\/.+?\b', link)
['stream://987cds9c8ujru56236te2ys28u99u2s']

Or if you want to match only word characters you can simply use '\\w+' :

>>> re.findall(r'stream:\/\/\w+', link)
['stream://987cds9c8ujru56236te2ys28u99u2s']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM