scrapy csv文件有統一的空行嗎？

Question

這是蜘蛛：

import scrapy
from danmurphys.items import DanmurphysItem

class MySpider(scrapy.Spider):
    name = 'danmurphys'
    allowed_domains = ['danmurphys.com.au']
    start_urls = ['https://www.danmurphys.com.au/dm/navigation/navigation_results_gallery.jsp?params=fh_location%3D%2F%2Fcatalog01%2Fen_AU%2Fcategories%3C%7Bcatalog01_2534374302084767_2534374302027742%7D%26fh_view_size%3D120%26fh_sort%3D-sales_value_30_days%26fh_modification%3D&resetnav=false&storeExclusivePage=false']


    def parse(self, response):        
        urls = response.xpath('//h2/a/@href').extract()
        for url in urls:            
            request = scrapy.Request(url , callback=self.parse_page)      
            yield request

    def parse_page(self , response):
        item = DanmurphysItem()
        item['brand'] = response.xpath('//span[@itemprop="brand"]/text()').extract_first().strip()
        item['name'] = response.xpath('//span[@itemprop="name"]/text()').extract_first().strip()
        item['url'] = response.url     
        return item

這是項目：

import scrapy
class DanmurphysItem(scrapy.Item):  
    brand = scrapy.Field()
    name = scrapy.Field()
    url = scrapy.Field()

當我用這個命令運行蜘蛛：

scrapy crawl danmurphys -o output.csv

輸出是這樣的：

Answer 1

要在Scrapy 1.3中修復此問題，您可以通過在io.TextIOWrapper中的CsvItemExporter類的__init__方法io.TextIOWrapper newline=''作為參數添加到io.TextIOWrapper來io.TextIOWrapper進行scrapy.exporters 。

Answer 2

此輸出顯示在Windows上使用"w"模式打開的csv文件句柄的典型症狀（可能修復Python 3兼容性）但省略newline 。

雖然這對基於Linux / Unix的系統沒有影響，但在Windows上，會發出2個回車字符，在每個數據行后插入一個假空白行。

with open("output.csv","w") as f:
     cr = csv.writer(f)

正確的做法（python 3）：

with open("output.csv","w",newline='') as f:  # python 3
     cr = csv.writer(f)

（在python 2中，將"wb"設置為打開模式修復它）

如果文件是由您不能或不想修改的程序創建的，則可以按如下方式對文件進行后處理：

with open("output.csv","rb") as f:
   with open("output_fix.csv","w") as f2:
       f2.write(f.read().decode().replace("\r","")) # python 3
       f2.write(f.read().replace("\r","")) # python 2

Answer 3

Scrapy 1.5.0 && Python 3.6.5 :: Anaconda，Inc。

我設法通過以下步驟解決此問題：

文件夾結構

C:.
|   scrapy.cfg
|
\---my_scraper
    |   exporters.py
    |   items.py
    |   middlewares.py
    |   pipelines.py
    |   settings.py
    |   __init__.py
    |
    +---spiders
    |   |   my_spider.py
    |   |   __init__.py
    |

exporters.py

# -*- coding: utf-8 -*-
import csv
import io
import os
import six

from scrapy.conf import settings
from scrapy.exporters import CsvItemExporter

from scrapy.extensions.feedexport import IFeedStorage
from w3lib.url import file_uri_to_path
from zope.interface import implementer

@implementer(IFeedStorage)
class FixedFileFeedStorage(object):

    def __init__(self, uri):
        self.path = file_uri_to_path(uri)

    def open(self, spider):
        dirname = os.path.dirname(self.path)
        if dirname and not os.path.exists(dirname):
            os.makedirs(dirname)
        return open(self.path, 'ab')

    def store(self, file):
        file.close()



class MyCsvItemExporter(CsvItemExporter):

    def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):

        # Custom delimiter
        delimiter = settings.get('CSV_DELIMITER', ';')
        kwargs['delimiter'] = delimiter

        super(MyCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)

        self._configure(kwargs, dont_fail=True)
        self.stream.close()
        storage = FixedFileFeedStorage(file.name)
        file = storage.open(file.name)
        self.stream = io.TextIOWrapper(
            file,
            line_buffering=False,
            write_through=True,
            encoding=self.encoding,
            newline="",
        ) if six.PY3 else file
        self.csv_writer = csv.writer(self.stream, **kwargs)

settings.py

# ...

FEED_EXPORT_ENCODING = 'utf-8'

FEED_EXPORTERS = {
    'csv': 'my_scraper.exporters.MyCsvItemExporter',
}

CSV_DELIMITER = ';'

我希望這對你有所幫助

Answer 4

特別感謝大家（Jean-François）

問題是我已經在conda中為python 3.5安裝了另一個scrapy版本1.1.0，一旦我在系統路徑中添加了python 2.7，原來的scrapy 1.1.2就恢復了默認工作狀態。 一切正常。

Answer 5

我通過pipelines.py文件解決了它：

我懷疑不理想，但我找到了解決這個問題的方法。 在pipelines.py文件中，我添加了更多代碼，基本上將空行讀取到csv文件到列表，然后刪除空行，然后將清理后的列表寫入新文件。

我添加的代碼是：

with open('%s_items.csv' % spider.name, 'r') as f:
  reader = csv.reader(f)
  original_list = list(reader)
  cleaned_list = list(filter(None,original_list))

with open('%s_items_cleaned.csv' % spider.name, 'w', newline='') as output_file:
    wr = csv.writer(output_file, dialect='excel')
    for data in cleaned_list:
      wr.writerow(data)

因此，整個pipelines.py文件的細節是Scrapy python csv輸出每行之間有空行

scrapy csv文件有統一的空行嗎？

問題描述

5 個解決方案

解決方案1
9 已采納 2017-04-13 14:11:45

解決方案2
1 2016-09-13 19:39:29

解決方案3
1 2018-08-05 23:35:08

Scrapy 1.5.0 && Python 3.6.5 :: Anaconda，Inc。

文件夾結構

exporters.py

settings.py

解決方案4
0 2016-09-14 11:26:16

解決方案5
0 2017-12-09 13:58:42

scrapy csv文件有統一的空行嗎？

問題描述

5 個解決方案

解決方案1 9 已采納 2017-04-13 14:11:45

解決方案2 1 2016-09-13 19:39:29

解決方案3 1 2018-08-05 23:35:08

Scrapy 1.5.0 && Python 3.6.5 :: Anaconda，Inc。

文件夾結構

exporters.py

settings.py

解決方案4 0 2016-09-14 11:26:16

解決方案5 0 2017-12-09 13:58:42

解決方案1
9 已采納 2017-04-13 14:11:45

解決方案2
1 2016-09-13 19:39:29

解決方案3
1 2018-08-05 23:35:08

解決方案4
0 2016-09-14 11:26:16

解決方案5
0 2017-12-09 13:58:42