How to scrape a news feed?

Question

I have been going through the Scrapy examples and they make sense, but as soon as I try it on a news feed I don't get anything but titles and don't know how to proceed.

scrapy shell http://feeds.bbci.co.uk/news/rss.xml

All I can get from this is

response.xpath('//title')

Which outputs

<Selector xpath='//title' data=u'<title xmlns:media="http://search.yahoo.'>]

How can I possible find the tags inside?

When I try this:

response.xpath('//div')

it returns null. I have tried Inspect Elements from Chome to check the content, but I can't somehow even get to the body to try out things. Thanks

Answer 1

rss is not an html document, it is xml document. You can find info on rss at http://www.w3schools.com/xml/xml_rss.asp . rss documents look something like:

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">

<channel>
  <title>W3Schools Home Page</title>
  <link>http://www.w3schools.com</link>
  <description>Free web building tutorials</description>
  <item>
    <title>RSS Tutorial</title>
    <link>http://www.w3schools.com/rss</link>
    <description>New RSS tutorial on W3Schools</description>
  </item>
  <item>
    <title>XML Tutorial</title>
    <link>http://www.w3schools.com/xml</link>
    <description>New XML tutorial on W3Schools</description>
  </item>
</channel>

</rss>

So there are no div tags in it. To get description of each post/news you can use response.xpath('//description/text()')

Scrapy docs can be found here http://doc.scrapy.org/en/latest/intro/tutorial.html

How to scrape a news feed?

Question

1 answers

solution1
2 ACCPTED 2015-01-10 14:42:53

How to scrape a news feed?

Question

1 answers

solution1 2 ACCPTED 2015-01-10 14:42:53

solution1
2 ACCPTED 2015-01-10 14:42:53