简体   繁体   English

如何抓取新闻提要?

[英]How to scrape a news feed?

I have been going through the Scrapy examples and they make sense, but as soon as I try it on a news feed I don't get anything but titles and don't know how to proceed. 我一直在研究Scrapy示例,它们很有意义,但是一旦在新闻源上尝试使用它,我除了标题之外就什么也看不到,也不知道如何进行。

scrapy shell http://feeds.bbci.co.uk/news/rss.xml

All I can get from this is 我只能从中得到的是

response.xpath('//title')

Which outputs 哪个输出

<Selector xpath='//title' data=u'<title xmlns:media="http://search.yahoo.'>]

How can I possible find the tags inside? 如何找到里面的标签?

When I try this: 当我尝试这个:

response.xpath('//div')

it returns null. 它返回null。 I have tried Inspect Elements from Chome to check the content, but I can't somehow even get to the body to try out things. 我已经尝试过从Chome检查Inspect Elements来检查内容,但是我什至无法以某种方式进入身体进行尝试。 Thanks 谢谢

rss is not an html document, it is xml document. rss不是html文档,而是xml文档。 You can find info on rss at http://www.w3schools.com/xml/xml_rss.asp . 您可以在http://www.w3schools.com/xml/xml_rss.asp上找到有关rss信息。 rss documents look something like: rss文件如下所示:

<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">

<channel>
  <title>W3Schools Home Page</title>
  <link>http://www.w3schools.com</link>
  <description>Free web building tutorials</description>
  <item>
    <title>RSS Tutorial</title>
    <link>http://www.w3schools.com/rss</link>
    <description>New RSS tutorial on W3Schools</description>
  </item>
  <item>
    <title>XML Tutorial</title>
    <link>http://www.w3schools.com/xml</link>
    <description>New XML tutorial on W3Schools</description>
  </item>
</channel>

</rss>

So there are no div tags in it. 因此,其中没有div标签。 To get description of each post/news you can use response.xpath('//description/text()') 要获取每个帖子/新闻的描述,可以使用response.xpath('//description/text()')

Scrapy docs can be found here http://doc.scrapy.org/en/latest/intro/tutorial.html Scrapy文档可在以下位置找到: http: //doc.scrapy.org/en/latest/intro/tutorial.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM