[英]How to extract specific URL from HTML using Beautiful Soup?
I want to extract specific URLs from an HTML page. 我想从HTML页面提取特定的URL。
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = http://bassrx.tumblr.com/tagged/tt # nsfw link
page = urlopen(url)
html = page.read() # get the html from the url
# this works without BeautifulSoup, but it is slow:
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
The output of the above is exactly the URL, nothing else: http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg
上面的输出完全是URL,没有其他内容:http:
http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg
The only downside is it is very slow. 唯一的缺点是它非常慢。
BeautifulSoup is extremely fast at parsing HTML, so that's why I want to use it. BeautifulSoup解析HTML的速度非常快,所以这就是我要使用它的原因。
The urls that I want are actually the img src
. 我想要的网址实际上是
img src
。 Here's a snippet from the HMTL that contains that information that I want. 这是HMTL的片段,其中包含我想要的信息。
<div class="media"><a href="http://bassrx.tumblr.com/image/85635265422">
<img src="http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg"/>
</a></div>
So, my question is, how can I get BeautifulSoup to extract all of those 'img src' urls cleanly without any other cruft? 因此,我的问题是,如何才能使BeautifulSoup干净地提取所有这些“ img src” URL,而无需进行其他处理?
I just want a list of matching urls. 我只想要一个匹配的URL列表。 I've been trying to use
soup.findall()
function, but cannot get any useful results. 我一直在尝试使用
soup.findall()
函数,但是无法获得任何有用的结果。
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://bassrx.tumblr.com/tagged/tt'
soup = BeautifulSoup(urlopen(url).read())
for element in soup.findAll('img'):
print(element.get('src'))
You can use div.media > a > img
CSS selector
to find img
tags inside a
which is inside a div
tag with media
class: 您可以使用
div.media > a > img
CSS selector
找到img
标签内a
这是一个内部div
与标签media
类:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "<url_here>"
soup = BeautifulSoup(urlopen(url))
images = soup.select('div.media > a > img')
print [image.get('src') for image in images]
In order to make the parsing faster you can use lxml
parser: 为了使解析速度更快,您可以使用
lxml
解析器:
soup = BeautifulSoup(urlopen(url), "lxml")
You need to install lxml
module first, of course. 当然,您需要首先安装
lxml
模块。
Also, you can make use of a SoupStrainer
class for parsing only relevant part of the document. 另外,您可以利用
SoupStrainer
类仅解析文档的相关部分。
Hope that helps. 希望能有所帮助。
Have a look a BeautifulSoup.find_all with re.compile mix 看看带有re.compile的BeautifulSoup.find_all
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = "http://bassrx.tumblr.com/tagged/tt" # nsfw link
page = urlopen(url)
html = page.read()
bs = BeautifulSoup(html)
a_tumblr = [a_element for a_element in bs.find_all(href=re.compile("media\.tumblr"))]
##[<link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="shortcut icon"/>, <link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="apple-touch-icon"/>]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.