Hi I am working on scrapy for fetching some html pages,
I had written my spider and i had fetched the required data from the pages in spider.py
file, and in my pipeline.py
file i want to write all the data in to a csv file
created dynamically with the name of the spider and below is my pipeline.py
code
pipeline.py:
from scrapy import log
from datetime import datetime
class examplepipeline(object):
def __init__(self):
dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
dispatcher.connect(self.spider_closed, signal=signals.spider_closed)
def spider_opened(self, spider):
log.msg("opened spider %s at time %s" % (spider.name,datetime.now().strftime('%H-%M-%S')))
self.exampleCsv = csv.writer(open("%s(%s).csv"% (spider.name,datetime.now().strftime("%d/%m/%Y,%H-%M-%S")), "wb"),
delimiter=',', quoting=csv.QUOTE_MINIMAL)
self.exampleCsv.writerow(['Listing Name', 'Address','Pincode','Phone','Website'])
def process_item(self, item, spider):
log.msg("Processsing item " + item['title'], level=log.DEBUG)
self.exampleCsv.writerow([item['listing_name'].encode('utf-8'),
item['address_1'].encode('utf-8'),
[i.encode('utf-8') for i in item['pincode']],
item['phone'].encode('utf-8'),
[i.encode('utf-8') for i in item['web_site']]
])
return item
def spider_closed(self, spider):
log.msg("closed spider %s at %s" % (spider.name,datetime.now().strftime('%H-%M-%S')))
Result:
--- <exception caught here> ---
File "/usr/lib64/python2.7/site-packages/twisted/internet/defer.py", line 133, in maybeDeferred
result = f(*args, **kw)
File "/usr/lib/python2.7/site-packages/Scrapy-0.14.3-py2.7.egg/scrapy/xlib/pydispatch/robustapply.py", line 47, in robustApply
return receiver(*arguments, **named)
File "/home/local/user/example/example/pipelines.py", line 19, in spider_opened
self.examplecsv = csv.writer(open("%s(%s).csv"% (spider.name,datetime.now().strftime("%d/%m/%Y,%H-%M-%S")), "wb"),
exceptions.IOError: [Errno 2] No such file or directory: 'example(27/07/2012,10-30-40).csv'
Here actually spider name is example
I don't understand whats wrong in the above code, it should create csv file dynamically with spider name, but showing the above mentioned error, can anyone please let me know whats happening there.........
The problem is with forward slash(directory separator) in your filename. It is not allowed. Try using some other character in the date.
More info here http://www.linuxquestions.org/questions/linux-software-2/forward-slash-in-filenames-665010/
This link is helpful for getting the format you want How to print date in a regular format in Python?
>>> import datetime
>>> datetime.date.today()
datetime.date(2012, 7, 27)
>>> str(datetime.date.today())
'2012-07-27'
Use this in your code
open("%s(%s).csv"% (spider.name,datetime.now().strftime("%d-%m-%Y:%H-%M-%S"))
As Kamal pointed out, the immediate issue is the presence of forward slashes in the file name you create. Kamal's solution works, but I would not fix this by using the method Kamal suggested but with:
open("%s(%s).csv"% (spider.name, datetime.now().replace(microsecond=0).isoformat())
The main thing here is the use of .isoformat()
to put it in the ISO 8601 format:
YYYY-MM-DDTHH:MM:SS.mmmmmm
which has the advantage of being trivially sortable in increasing chronological order. The .replace(microsecond=0)
call is to remove the microsecond information, in which case the trailing .mmmmm
will be absent from the output of .isoformat()
. You can drop the call to .replace()
if you want to keep microsecond information. When I drop the microseconds, I write the rest of my applications to prevent two invocations from creating the same file.
Also, you could drop your custom __init__
and rename spider_opened
to open_spider
, and spider_closed
to close_spider
. Scrapy will automatically call open_spider
when a spider is opened and close_spider
when a spider is closed. You do not have to hook onto the signals. The documentation mentions these methods as far back as Scrapy 0.7.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.