在我的拼凑项目的pipelines.py中,我试图将抓取的项目保存到MongoDB。 但是,我不确定我的做法是否正确,因为在刮擦之后,当我进入mongo shell并使用find()方法时,什么也没回来。 在我进行抓取期间,scrapy的日志确实向我显示了所有已抓取的项目,并且通过save ...
提示:本站收集StackOverFlow近2千万问答,支持中英文搜索,鼠标放在语句上弹窗显示对应的参考中文或英文, 本站还提供 中文繁体 英文版本 中英对照 版本,有任何建议请联系yoyou2525@163.com。
我正在使用Python和Scrapy库,它的想法是蜘蛛化url,将所需的字段保存到db中(在本例中为新闻项),不幸的是,它目前仅保存1个列表项,而不是几个。似乎无法正确迭代。
非常感谢您的帮助
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from scraper_app.items import ListItem
class ListSpider(BaseSpider):
name = "news_list"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/Default/Section/1"]
news_items_xpath = '//*[@id="section-news"]/section/ul/li[1]/div'
item_fields = { 'title': './/div/h3',
'link': './/div/h3/a',
'description': './/div/p/text()',
'date': './/div/div[2]'}
def parse(self, response):
selector = HtmlXPathSelector(response)
# iterate over deals
for news in selector.select(self.news_items_xpath):
loader = XPathItemLoader(ListItem(), selector=news)
# define processors
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
# iterate over fields and add xpaths to the loader
for field, xpath in self.item_fields.iteritems():
loader.add_xpath(field, xpath)
yield loader.load_item()
HTML:
<div id="section-news" class="block secondary">
<section class="inner">
<ul class="thumbs">
<li>
<div>
<div class="img">
<a href="/Detail/2015/01/14/393107/AntiIsraelism-not-antiSemitism"><img src="http://217.218.67.233/photo/20150114/59b5efd9-3c1c-47b1-a014-4ca0fedadeb6.jpg" alt="uk jews" /><i class="icon-play"></i></a>
</div>
<div class="desc">
<h3 class="title"><a href="/Detail/2015/01/14/393107/AntiIsraelism-not-antiSemitism">‘Anti-Israelism not anti-Semitism’</a></h3>
<div class="date">Wed Jan 14, 2015 7:27PM</div>
<p>A new survey which reveals that nearly half of Britons hold anti-Semitic views.</p>
</div>
</div>
</li>
<li>
<div>
<div class="img">
<a href="/Detail/2015/01/14/393095/Turkey-bans-arms-delivery-reports"><img src="http://217.218.67.233/photo/20150114/2de1eb77-ba2a-49c9-a232-ab4cf82ffc1d.jpg" alt="Syria-militants" /></a>
</div>
<div class="desc">
<h3 class="title"><a href="/Detail/2015/01/14/393095/Turkey-bans-arms-delivery-reports">Turkey bans arms delivery reports</a></h3>
<div class="date">Wed Jan 14, 2015 7:22PM</div>
<p>Turkey bans media reports on alleged arms delivery to militants in Syria.</p>
</div>
</div>
</li>
<li>
<div>
<div class="img">
<a href="/Detail/2015/01/14/393099/Egypt-Israel-gas-imports-possible"><img src="http://217.218.67.233/photo/20150114/c63935fb-8221-43fc-8103-6f49f013cbfd.jpg" alt="Egypt-Israel" /></a>
</div>
<div class="desc">
<h3 class="title"><a href="/Detail/2015/01/14/393099/Egypt-Israel-gas-imports-possible">Egypt: Israel gas imports possible</a></h3>
<div class="date">Wed Jan 14, 2015 7:11PM</div>
<p>Egypt says importing gas from Israel is a possibility.</p>
</div>
</div>
</li>
问题是您的xpath仅限于单个列表条目
news_items_xpath = '//*[@id="section-news"]/section/ul/li[1]/div'
删除[1]
news_items_xpath = '//*[@id="section-news"]/section/ul/li/div'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.