I am trying to learn a bit of beautiful soup, and to get some html data out of some iFrames - but I have not been very successful so far.
So, parsing the iFrame in itself does not seem to be a problem with BS4, but I do not seem to get the embedded content from this - whatever I do.
For example, consider the below iFrame (this is what I see on chrome developer tools):
<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO"
src="http://www.engineeringmaterials.com/boron/728x90.html "width="728" height="90">
#document <html>....</html></iframe>
where, <html>...</html>
is the content I am interested in extracting.
However, when I use the following BS4 code:
iFrames=[] # qucik bs4 example
for iframe in soup("iframe"):
iFrames.append(soup.iframe.extract())
I get:
<iframe frameborder="0" marginwidth="0" marginheight="0" scrolling="NO" src="http://www.engineeringmaterials.com/boron/728x90.html" width="728" height="90">
In other words, I get the iFrames without the document <html>...</html>
within them.
I tried something along the lines of:
iFrames=[] # qucik bs4 example
iframexx = soup.find_all('iframe')
for iframe in iframexx:
print iframe.find_all('html')
.. but this does not seem to work..
So, I guess my question is, how do I reliably extract these document objects <html>...</html>
from the iFrame elements.
Browsers load the iframe content in a separate request . You'll have to do the same:
for iframe in iframexx:
response = urllib2.urlopen(iframe.attrs['src'])
iframe_soup = BeautifulSoup(response)
Remember: BeautifulSoup is not a browser; it won't fetch images, CSS and JavaScript resources for you either.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.