简体   繁体   English

保存网页抓取结果(Scrapy)

[英]Saving Web Crawling Results (Scrapy)

I've written a spider that seems to be functioning properly, but I'm not sure how to save the data it is collecting. 我写了一个似乎运行正常的蜘蛛,但我不确定如何保存它正在收集的数据。

The spider starts out at TheScienceForum , grabs the main forum pages and makes an item for each. 蜘蛛从TheScienceForum开始,抓住主要的论坛页面并为每个页面制作一个项目。 It then proceeds to go through all of the individual forum pages (passing the items along with it), adding the title of each thread to the matching forum item. 然后它继续浏览所有单独的论坛页面(随之传递项目),将每个线程的标题添加到匹配的论坛项目。 The code is as follow: 代码如下:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request

from individualProject.items import ProjectItem

class TheScienceForum(BaseSpider):
    name = "TheScienceForum.com"
    allowed_domains = ["www.thescienceforum.com"]
    start_urls = ["http://www.thescienceforum.com"]

    def parse(self, response):
        Sel = HtmlXPathSelector(response)
        forumNames = Sel.select('//h2[@class="forumtitle"]/a/text()').extract()
        items = []
        for forumName in forumNames:
            item = ProjectItem()
            item['name'] = forumName
            items.append(item)


        forums = Sel.select('//h2[@class="forumtitle"]/a/@href').extract()
        itemDict = {}
        itemDict['items'] = items
        for forum in forums:
            yield Request(url=forum,meta=itemDict,callback=self.addThreadNames)  

    def addThreadNames(self, response):
        items = response.meta['items']
        Sel = HtmlXPathSelector(response)
        currentForum = Sel.select('//h1/span[@class="forumtitle"]/text()').extract()
        for item in items:
            if currentForum==item['name']:
                item['thread'] += Sel.select('//h3[@class="threadtitle"]/a/text()').extract()
        self.log(items)


        itemDict = {}
        itemDict['items'] = items
        threadPageNavs = Sel.select('//span[@class="prev_next"]/a[@rel="next"]/@href').extract()
        for threadPageNav in threadPageNavs:  
            yield Request(url=threadPageNav,meta=itemDict,callback=self.addThreadNames)

It seems that because I never simply return the object (only new requests are yielded) that the data never persists anywhere. 似乎因为我从不简单地返回对象(只产生新的请求),数据永远不会存在于任何地方。 I've tried using the following JSON pipeline: 我尝试使用以下JSON管道:

class JsonWriterPipeline(object):

    def __init__(self):
        self.file = open('items.jl', 'wb')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

And also running the spider with the following: 还运行以下蜘蛛:

scrapy crawl TheScienceForum.com -o items.json -t json

but so far nothing is working. 但到目前为止没有任何工作。 Where might I be going wrong? 我哪里可能出错?

Any ideas or positive criticism is warmly welcomed. 任何想法或积极的批评都受到热烈欢迎。

您需要在至少一个回调中yield item

@bornytm - @bornytm -

self.addThreadNames 

is your function to which you are passing the urls or intermediate results. 是您传递网址或中间结果的功能。 If you want to save it to the file csv or json , you can do like follows 如果要将其保存到文件csv或json,可以执行以下操作

yield "result"  ("result" can be replaced with your variable to which you are storing data. If you have multiple value , use yield in for loop. )

After that 之后

scrapy crawl TheScienceForum.com -o output.csv -t csv

This will help you 这会对你有所帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM