如何抓取新闻提要？

Question

我一直在研究Scrapy示例，它们很有意义，但是一旦在新闻源上尝试使用它，我除了标题之外就什么也看不到，也不知道如何进行。

scrapy shell http://feeds.bbci.co.uk/news/rss.xml

我只能从中得到的是

response.xpath('//title')

哪个输出

<Selector xpath='//title' data=u'<title xmlns:media="http://search.yahoo.'>]

如何找到里面的标签？

当我尝试这个：

response.xpath('//div')

它返回null。 我已经尝试过从Chome检查Inspect Elements来检查内容，但是我什至无法以某种方式进入身体进行尝试。 谢谢

Answer 1

rss不是html文档，而是xml文档。 您可以在http://www.w3schools.com/xml/xml_rss.asp上找到有关rss信息。 rss文件如下所示：

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">

<channel>
  <title>W3Schools Home Page</title>
  <link>http://www.w3schools.com</link>
  <description>Free web building tutorials</description>
  <item>
    <title>RSS Tutorial</title>
    <link>http://www.w3schools.com/rss</link>
    <description>New RSS tutorial on W3Schools</description>
  </item>
  <item>
    <title>XML Tutorial</title>
    <link>http://www.w3schools.com/xml</link>
    <description>New XML tutorial on W3Schools</description>
  </item>
</channel>

</rss>

因此，其中没有div标签。 要获取每个帖子/新闻的描述，可以使用response.xpath('//description/text()')

Scrapy文档可在以下位置找到： http： //doc.scrapy.org/en/latest/intro/tutorial.html

如何抓取新闻提要？

问题描述

1 个解决方案

解决方案1
2 已采纳 2015-01-10 14:42:53

如何抓取新闻提要？

问题描述

1 个解决方案

解决方案1 2 已采纳 2015-01-10 14:42:53

解决方案1
2 已采纳 2015-01-10 14:42:53