[英]How to scrape a news feed?
I have been going through the Scrapy
examples and they make sense, but as soon as I try it on a news feed I don't get anything but titles and don't know how to proceed. 我一直在研究Scrapy
示例,它们很有意义,但是一旦在新闻源上尝试使用它,我除了标题之外就什么也看不到,也不知道如何进行。
scrapy shell http://feeds.bbci.co.uk/news/rss.xml
All I can get from this is 我只能从中得到的是
response.xpath('//title')
Which outputs 哪个输出
<Selector xpath='//title' data=u'<title xmlns:media="http://search.yahoo.'>]
How can I possible find the tags inside? 如何找到里面的标签?
When I try this: 当我尝试这个:
response.xpath('//div')
it returns null. 它返回null。 I have tried Inspect Elements from Chome to check the content, but I can't somehow even get to the body to try out things. 我已经尝试过从Chome检查Inspect Elements来检查内容,但是我什至无法以某种方式进入身体进行尝试。 Thanks 谢谢
rss
is not an html
document, it is xml
document. rss
不是html
文档,而是xml
文档。 You can find info on rss
at http://www.w3schools.com/xml/xml_rss.asp . 您可以在http://www.w3schools.com/xml/xml_rss.asp上找到有关rss
信息。 rss
documents look something like: rss
文件如下所示:
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0">
<channel>
<title>W3Schools Home Page</title>
<link>http://www.w3schools.com</link>
<description>Free web building tutorials</description>
<item>
<title>RSS Tutorial</title>
<link>http://www.w3schools.com/rss</link>
<description>New RSS tutorial on W3Schools</description>
</item>
<item>
<title>XML Tutorial</title>
<link>http://www.w3schools.com/xml</link>
<description>New XML tutorial on W3Schools</description>
</item>
</channel>
</rss>
So there are no div
tags in it. 因此,其中没有div
标签。 To get description of each post/news you can use response.xpath('//description/text()')
要获取每个帖子/新闻的描述,可以使用response.xpath('//description/text()')
Scrapy docs can be found here http://doc.scrapy.org/en/latest/intro/tutorial.html Scrapy文档可在以下位置找到: http: //doc.scrapy.org/en/latest/intro/tutorial.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.