python beautifulsoup iframe文檔html提取

Question

我正在嘗試學習一些美味的湯，並從一些iFrame中獲取一些HTML數據 - 但到目前為止我還沒有取得很大成功。

因此，解析iFrame本身似乎不是BS4的問題，但我似乎無法從中獲取嵌入式內容 - 無論我做什么。

例如，考慮下面的iFrame（這是我在chrome開發人員工具上看到的）：

<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"
src="http://www.engineeringmaterials.com/boron/728x90.html "width="728" height="90">
#document <html>....</html></iframe>

其中， <html>...</html>是我有興趣提取的內容。

但是，當我使用以下BS4代碼時：

iFrames=[] # qucik bs4 example
for iframe in soup("iframe"):
    iFrames.append(soup.iframe.extract())

我明白了：

<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO" src="http://www.engineeringmaterials.com/boron/728x90.html" width="728" height="90">

換句話說，我得到的iFrames中沒有文檔<html>...</html> 。

我嘗試過以下方面：

iFrames=[] # qucik bs4 example
iframexx = soup.find_all('iframe')
for iframe in iframexx:
    print iframe.find_all('html')

..但這似乎不起作用..

所以，我想我的問題是，如何從iFrame元素中可靠地提取這些文檔對象<html>...</html> 。

Answer 1

瀏覽器在單獨的請求中加載iframe內容。 你必須這樣做：

for iframe in iframexx:
    response = urllib2.urlopen(iframe.attrs['src'])
    iframe_soup = BeautifulSoup(response)

請記住：BeautifulSoup不是瀏覽器; 它也不會為您獲取圖像，CSS和JavaScript資源。

python beautifulsoup iframe文檔html提取

問題描述

1 個解決方案

解決方案1
14 已采納 2014-04-12 09:38:33

python beautifulsoup iframe文檔html提取

問題描述

1 個解決方案

解決方案1 14 已采納 2014-04-12 09:38:33

解決方案1
14 已采納 2014-04-12 09:38:33