简体   繁体   中英

Saving Web Crawling Results (Scrapy)

I've written a spider that seems to be functioning properly, but I'm not sure how to save the data it is collecting.

The spider starts out at TheScienceForum , grabs the main forum pages and makes an item for each. It then proceeds to go through all of the individual forum pages (passing the items along with it), adding the title of each thread to the matching forum item. The code is as follow:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request

from individualProject.items import ProjectItem

class TheScienceForum(BaseSpider):
    name = "TheScienceForum.com"
    allowed_domains = ["www.thescienceforum.com"]
    start_urls = ["http://www.thescienceforum.com"]

    def parse(self, response):
        Sel = HtmlXPathSelector(response)
        forumNames = Sel.select('//h2[@class="forumtitle"]/a/text()').extract()
        items = []
        for forumName in forumNames:
            item = ProjectItem()
            item['name'] = forumName
            items.append(item)


        forums = Sel.select('//h2[@class="forumtitle"]/a/@href').extract()
        itemDict = {}
        itemDict['items'] = items
        for forum in forums:
            yield Request(url=forum,meta=itemDict,callback=self.addThreadNames)  

    def addThreadNames(self, response):
        items = response.meta['items']
        Sel = HtmlXPathSelector(response)
        currentForum = Sel.select('//h1/span[@class="forumtitle"]/text()').extract()
        for item in items:
            if currentForum==item['name']:
                item['thread'] += Sel.select('//h3[@class="threadtitle"]/a/text()').extract()
        self.log(items)


        itemDict = {}
        itemDict['items'] = items
        threadPageNavs = Sel.select('//span[@class="prev_next"]/a[@rel="next"]/@href').extract()
        for threadPageNav in threadPageNavs:  
            yield Request(url=threadPageNav,meta=itemDict,callback=self.addThreadNames)

It seems that because I never simply return the object (only new requests are yielded) that the data never persists anywhere. I've tried using the following JSON pipeline:

class JsonWriterPipeline(object):

    def __init__(self):
        self.file = open('items.jl', 'wb')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

And also running the spider with the following:

scrapy crawl TheScienceForum.com -o items.json -t json

but so far nothing is working. Where might I be going wrong?

Any ideas or positive criticism is warmly welcomed.

您需要在至少一个回调中yield item

@bornytm -

self.addThreadNames 

is your function to which you are passing the urls or intermediate results. If you want to save it to the file csv or json , you can do like follows

yield "result"  ("result" can be replaced with your variable to which you are storing data. If you have multiple value , use yield in for loop. )

After that

scrapy crawl TheScienceForum.com -o output.csv -t csv

This will help you

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM