python beautifulsoup iframe文档html提取

Question

I am trying to learn a bit of beautiful soup, and to get some html data out of some iFrames - but I have not been very successful so far. 我正在尝试学习一些美味的汤，并从一些iFrame中获取一些HTML数据 - 但到目前为止我还没有取得很大成功。

So, parsing the iFrame in itself does not seem to be a problem with BS4, but I do not seem to get the embedded content from this - whatever I do. 因此，解析iFrame本身似乎不是BS4的问题，但我似乎无法从中获取嵌入式内容 - 无论我做什么。

For example, consider the below iFrame (this is what I see on chrome developer tools): 例如，考虑下面的iFrame（这是我在chrome开发人员工具上看到的）：

<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"
src="http://www.engineeringmaterials.com/boron/728x90.html "width="728" height="90">
#document <html>....</html></iframe>

where, <html>...</html> is the content I am interested in extracting. 其中， <html>...</html>是我有兴趣提取的内容。

However, when I use the following BS4 code: 但是，当我使用以下BS4代码时：

iFrames=[] # qucik bs4 example
for iframe in soup("iframe"):
    iFrames.append(soup.iframe.extract())

I get: 我明白了：

<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO" src="http://www.engineeringmaterials.com/boron/728x90.html" width="728" height="90">

In other words, I get the iFrames without the document <html>...</html> within them. 换句话说，我得到的iFrames中没有文档<html>...</html> 。

I tried something along the lines of: 我尝试过以下方面：

iFrames=[] # qucik bs4 example
iframexx = soup.find_all('iframe')
for iframe in iframexx:
    print iframe.find_all('html')

.. but this does not seem to work.. ..但这似乎不起作用..

So, I guess my question is, how do I reliably extract these document objects <html>...</html> from the iFrame elements. 所以，我想我的问题是，如何从iFrame元素中可靠地提取这些文档对象<html>...</html> 。

Answer 1

Browsers load the iframe content in a separate request . 浏览器在单独的请求中加载iframe内容。 You'll have to do the same: 你必须这样做：

for iframe in iframexx:
    response = urllib2.urlopen(iframe.attrs['src'])
    iframe_soup = BeautifulSoup(response)

Remember: BeautifulSoup is not a browser; 请记住：BeautifulSoup不是浏览器; it won't fetch images, CSS and JavaScript resources for you either. 它也不会为您获取图像，CSS和JavaScript资源。

python beautifulsoup iframe文档html提取

问题描述

1 个解决方案

解决方案1
14 已采纳 2014-04-12 09:38:33

python beautifulsoup iframe文档html提取

问题描述

1 个解决方案

解决方案1 14 已采纳 2014-04-12 09:38:33

解决方案1
14 已采纳 2014-04-12 09:38:33