简体   繁体   中英

More than one process_item method in pipeline file in python scrapy

I am working on scrapy, i have created two spider files with two different urls in a single scrapy project.

And the two spiders scraping perfectly when runned individually. Actually the problem is each url has different items to fetch so declared all items in items.py file. Here after scraping i am storing the data in to csv file created dynamically with the name of spider.

So for example when i need to run spider1 , i need to declare separate process_item method, as items are different for both spiders and when i need to run second spider i need to write another process_item by commenting other method. Whether there is any way in scrapy to use two process_item method? and below is my pipeline.py code

pipeline.py

from w3c_browser.items import WCBrowserItem
import csv
from csv import DictWriter
from cStringIO import StringIO
from datetime import datetime
class W3CBrowserPipeline(object):
    def __init__(self):
        dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
        dispatcher.connect(self.spider_closed, signal=signals.spider_closed)
        self.brandCategoryCsv = csv.writer(open('wcbbrowser.csv', 'wb'))

    def spider_opened(self, spider):
        spider.started_on = datetime.now()
        if spider.name == 'browser_statistics':
            log.msg("opened spider  %s at time %s" % (spider.name,datetime.now().strftime('%H-%M-%S')))
            self.brandCategoryCsv = csv.writer(open("csv/%s-%s.csv"% (spider.name,datetime.now().strftime('%d%m%y')), "wb"),
                       delimiter=',', quoting=csv.QUOTE_MINIMAL)
        elif spider.name == 'browser_os':
            log.msg("opened spider  %s at time %s" % (spider.name,datetime.now().strftime('%H-%M-%S')))
            self.brandCategoryCsv = csv.writer(open("csv/%s-%s.csv"% (spider.name,datetime.now().strftime('%d%m%y')), "wb"),
                       delimiter=',', quoting=csv.QUOTE_MINIMAL)

    def process_item(self, item, spider):
        self.brandCategoryCsv.writerow([item['year'],
                                        item['internet_explorer'],
                                        item['firefox'],
                                        item['chrome'],
                                        item['safari'],
                                        item['opera'],


        ])
        return item
# For Browser Os
#    def process_item(self, item, spider):
#        self.brandCategoryCsv.writerow([item['year'],
#                                        item['vista'],
#                                        item['nt'],
#                                        item['winxp'],
#                                        item['linux'],
#                                        item['mac'],
#                                        item['mobile'],
#                                        
#
#        ])
#        return item

    def spider_closed(self, spider):
        log.msg("closed spider %s at %s" % (spider.name,datetime.now().strftime('%H-%M-%S')))
        work_time = datetime.now() - spider.started_on
        print str(work_time),"Total Time taken by the spider to run>>>>>>>>>>>"

Here in the above code as you observe, when i run the spider with the name browser_statistics it will create a csv file with browser_statistics-date format and writes the data from item in to csv file

But when i want to run the second spider with name browser_os , process_item method doesn't work because both spiders are having different items to fetch

Can anyone please let me know

Is there anyway to run more than one spiders with same process_item with different items ?

best option is you can use IF ELSE based of spider name in process_item

like

def process_item(self, item, spider):
    if 'spider1' in spider.name:
        #TODO write CSV for spider1
    else:
        #Should be a spider2
        #TODO write CSV for spider2
    return item

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM