简体   繁体   中英

How to create a csv file dynamically with name of the spider in scrapy python

Hi I am working on scrapy for fetching some html pages,

I had written my spider and i had fetched the required data from the pages in spider.py file, and in my pipeline.py file i want to write all the data in to a csv file created dynamically with the name of the spider and below is my pipeline.py code

pipeline.py:

from scrapy import log
from datetime import datetime


class examplepipeline(object):

    def __init__(self):
        dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
        dispatcher.connect(self.spider_closed, signal=signals.spider_closed)

    def spider_opened(self, spider):
        log.msg("opened spider  %s at time %s" % (spider.name,datetime.now().strftime('%H-%M-%S')))
        self.exampleCsv = csv.writer(open("%s(%s).csv"% (spider.name,datetime.now().strftime("%d/%m/%Y,%H-%M-%S")), "wb"),
                   delimiter=',', quoting=csv.QUOTE_MINIMAL)
        self.exampleCsv.writerow(['Listing Name', 'Address','Pincode','Phone','Website'])           

    def process_item(self, item, spider):
        log.msg("Processsing item " + item['title'], level=log.DEBUG)
        self.exampleCsv.writerow([item['listing_name'].encode('utf-8'),
                                    item['address_1'].encode('utf-8'),
                                    [i.encode('utf-8') for i in item['pincode']],
                                    item['phone'].encode('utf-8'),
                                    [i.encode('utf-8') for i in item['web_site']]
                                    ])
        return item 


    def spider_closed(self, spider):
        log.msg("closed spider %s at %s" % (spider.name,datetime.now().strftime('%H-%M-%S')))

Result:

--- <exception caught here> ---
  File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 133, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/xlib/pydispatch/robustapply.py", line 47, in robustApply
    return receiver(*arguments, **named)
  File "/home/local/user/example/example/pipelines.py", line 19, in spider_opened
    self.examplecsv = csv.writer(open("%s(%s).csv"% (spider.name,datetime.now().strftime("%d/%m/%Y,%H-%M-%S")), "wb"),
exceptions.IOError: [Errno 2] No such file or directory: 'example(27/07/2012,10-30-40).csv'

Here actually spider name is example

I don't understand whats wrong in the above code, it should create csv file dynamically with spider name, but showing the above mentioned error, can anyone please let me know whats happening there.........

The problem is with forward slash(directory separator) in your filename. It is not allowed. Try using some other character in the date.

More info here http://www.linuxquestions.org/questions/linux-software-2/forward-slash-in-filenames-665010/

This link is helpful for getting the format you want How to print date in a regular format in Python?

>>> import datetime
>>> datetime.date.today()
datetime.date(2012, 7, 27)
>>> str(datetime.date.today())
'2012-07-27'

Use this in your code

open("%s(%s).csv"% (spider.name,datetime.now().strftime("%d-%m-%Y:%H-%M-%S"))

As Kamal pointed out, the immediate issue is the presence of forward slashes in the file name you create. Kamal's solution works, but I would not fix this by using the method Kamal suggested but with:

open("%s(%s).csv"% (spider.name, datetime.now().replace(microsecond=0).isoformat())

The main thing here is the use of .isoformat() to put it in the ISO 8601 format:

YYYY-MM-DDTHH:MM:SS.mmmmmm

which has the advantage of being trivially sortable in increasing chronological order. The .replace(microsecond=0) call is to remove the microsecond information, in which case the trailing .mmmmm will be absent from the output of .isoformat() . You can drop the call to .replace() if you want to keep microsecond information. When I drop the microseconds, I write the rest of my applications to prevent two invocations from creating the same file.

Also, you could drop your custom __init__ and rename spider_opened to open_spider , and spider_closed to close_spider . Scrapy will automatically call open_spider when a spider is opened and close_spider when a spider is closed. You do not have to hook onto the signals. The documentation mentions these methods as far back as Scrapy 0.7.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM