简体   繁体   中英

How to extract specific URL from HTML using Beautiful Soup?

I want to extract specific URLs from an HTML page.

from urllib2 import urlopen
import re
from bs4 import BeautifulSoup

url = http://bassrx.tumblr.com/tagged/tt    # nsfw link
page = urlopen(url)
html = page.read()    # get the html from the url

# this works without BeautifulSoup, but it is slow:
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)

print image_links

The output of the above is exactly the URL, nothing else: http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg

The only downside is it is very slow.

BeautifulSoup is extremely fast at parsing HTML, so that's why I want to use it.

The urls that I want are actually the img src . Here's a snippet from the HMTL that contains that information that I want.

    <div class="media"><a href="http://bassrx.tumblr.com/image/85635265422">
    <img src="http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg"/>
</a></div>

So, my question is, how can I get BeautifulSoup to extract all of those 'img src' urls cleanly without any other cruft?

I just want a list of matching urls. I've been trying to use soup.findall() function, but cannot get any useful results.

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://bassrx.tumblr.com/tagged/tt'
soup = BeautifulSoup(urlopen(url).read())

for element in soup.findAll('img'):
    print(element.get('src'))

You can use div.media > a > img CSS selector to find img tags inside a which is inside a div tag with media class:

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = "<url_here>"
soup = BeautifulSoup(urlopen(url))
images = soup.select('div.media > a > img')
print [image.get('src') for image in images]

In order to make the parsing faster you can use lxml parser:

soup = BeautifulSoup(urlopen(url), "lxml")

You need to install lxml module first, of course.

Also, you can make use of a SoupStrainer class for parsing only relevant part of the document.

Hope that helps.

Have a look a BeautifulSoup.find_all with re.compile mix

from urllib2 import urlopen
import re
from bs4 import BeautifulSoup

url = "http://bassrx.tumblr.com/tagged/tt"    # nsfw link
page = urlopen(url)
html = page.read()    
bs = BeautifulSoup(html)
a_tumblr = [a_element for a_element in bs.find_all(href=re.compile("media\.tumblr"))]
##[<link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="shortcut icon"/>, <link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="apple-touch-icon"/>]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM