I have searched high and low for a decent explanation of how BeautifulSoup or LXML work. Granted, their documentation is great, but for someone like myself, a python/programming novice, it is difficult to decipher what I am looking for.
Anyways, as my first project, I am using Python to parse an RSS feed for post links - I have accomplished this with Feedparser. My plan is to then scrape each posts' images. For the life of me though, I can not figure out how to get either BeautifulSoup or LXML to do what I want! I have spent hours reading through the documentation and googling to no avail, so I am here. The following is a line from the Big Picture (my scrapee).
<div class="bpBoth"><a name="photo2"></a><img src="http://inapcache.boston.com/universal/site_graphics/blogs/bigpicture/shanghaifire_11_22/s02_25947507.jpg" class="bpImage" style="height:1393px;width:990px" /><br/><div onclick="this.style.display='none'" class="noimghide" style="margin-top:-1393px;height:1393px;width:990px"></div><div class="bpCaption"><div class="photoNum"><a href="#photo2">2</a></div>In this photo released by China's Xinhua news agency, spectators watch an apartment building on fire in the downtown area of Shanghai on Monday Nov. 15, 2010. (AP Photo/Xinhua) <a href="#photo2">#</a><div class="cf"></div></div></div>
So, according to my understanding of the documentation, I should be able to pass the following:
soup.find("a", { "class" : "bpImage" })
To find all instances with that css class. Well, it doesn't return anything. I'm sure I'm overlooking something trivial so I greatly appreciate your patience.
Thank you very much for your responses!
For future googlers, I'll include my feedparser code:
#! /usr/bin/python
# RSS Feed Parser for the Big Picture Blog
# Import applicable libraries
import feedparser
#Import Feed for Parsing
d = feedparser.parse("http://feeds.boston.com/boston/bigpicture/index")
# Print feed name
print d['feed']['title']
# Determine number of posts and set range maximum
posts = len(d['entries'])
# Collect Post URLs
pointer = 0
while pointer < posts:
e = d.entries[pointer]
print e.link
pointer = pointer + 1
Using lxml, you might do something like this:
import feedparser
import lxml.html as lh
import urllib2
#Import Feed for Parsing
d = feedparser.parse("http://feeds.boston.com/boston/bigpicture/index")
# Print feed name
print d['feed']['title']
# Determine number of posts and set range maximum
posts = len(d['entries'])
# Collect Post URLs
for post in d['entries']:
link=post['link']
print('Parsing {0}'.format(link))
doc=lh.parse(urllib2.urlopen(link))
imgs=doc.xpath('//img[@class="bpImage"]')
for img in imgs:
print(img.attrib['src'])
The code you have posted looks for all a
elements with the bpImage
class. But your example has the bpImage
class on the img
element, not the a
. You just need to do:
soup.find("img", { "class" : "bpImage" })
Using pyparsing to search for tags is fairly intuitive:
from pyparsing import makeHTMLTags, withAttribute
imgTag,notused = makeHTMLTags('img')
# only retrieve <img> tags with class='bpImage'
imgTag.setParseAction(withAttribute(**{'class':'bpImage'}))
for img in imgTag.searchString(html):
print img.src
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.