简体   繁体   English

Scrapy python csv输出在每行之间有空行

[英]Scrapy python csv output has blank lines between each row

I am getting unwanted blank lines between each row of scrapy output in the resulting csv output file. 我在生成的csv输出文件中的每行scrapy输出之间得到不需要的空行。

I have moved from python2 to python 3, and I use Windows 10. I am therefore in the process of adapting my scrapy projects for python3. 我已经从python2迁移到python 3,并且我使用的是Windows 10.因此我正在调整我的scrapy项目用于python3。

My current (and for now, sole) problem is that when I write the scrapy output to a CSV file I get a blank line between each row. 我当前(现在,唯一的)问题是,当我将scrapy输出写入CSV文件时,我在每行之间得到一个空行。 This has been highlighted on several posts here (it is to do with Windows) but I am unable to get a solution to work. 这里已经在几个帖子中强调了这一点(它与Windows有关),但我无法获得解决方案。

As it happens, I have also added some code to the piplines.py file to ensure the csv output is in a given column order and not some random order. 碰巧的是,我还在piplines.py文件中添加了一些代码,以确保csv输出处于给定的列顺序而不是一些随机顺序。 Hence, I can use the normal scrapy crawl charleschurch to run this code rather than the scrapy crawl charleschurch -o charleschurch2017xxxx.csv 因此,我可以使用普通的scrapy crawl charleschurch运行此代码而不是scrapy crawl charleschurch -o charleschurch2017xxxx.csv

Does anyone know how to skip / omit this blank line in the CSV output? 有谁知道如何在CSV输出中跳过/省略此空白行?

My pipelines.py code is below (I perhaps don't need the import csv line but I suspect I may do for the final answer): 我的pipelines.py代码在下面(我可能不需要import csv行,但我怀疑我可能会做最后的答案):

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import csv
from scrapy import signals
from scrapy.exporters import CsvItemExporter

class CSVPipeline(object):

  def __init__(self):
    self.files = {}

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = ["plotid","plotprice","plotname","name","address"]
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item

I added this line to the settings.py file (not sure the relevance of the 300): 我将此行添加到settings.py文件中(不确定300的相关性):

ITEM_PIPELINES = {'CharlesChurch.pipelines.CSVPipeline': 300 }

my scrapy code is below: 我的scrapy代码如下:

import scrapy
from urllib.parse import urljoin

from CharlesChurch.items import CharleschurchItem

class charleschurchSpider(scrapy.Spider):
    name = "charleschurch"
    allowed_domains = ["charleschurch.com"]    
    start_urls = ["https://www.charleschurch.com/county-durham_willington/the-ridings-1111"]


    def parse(self, response):

        for sel in response.xpath('//*[@id="aspnetForm"]/div[4]'):
           item = CharleschurchItem()
           item['name'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/span[1]/b/text()').extract()
           item['address'] = sel.xpath('//*[@id="XplodePage_ctl12_dsDetailsSnippet_pDetailsContainer"]/div/*[@itemprop="postalCode"]/text()').extract()
           plotnames = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/text()').extract()
           plotnames = [plotname.strip() for plotname in plotnames]
           plotids = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__name"]/a/@href').extract()
           plotids = [plotid.strip() for plotid in plotids]
           plotprices = sel.xpath('//div[@class="housetype js-filter-housetype"]/div[@class="housetype__col-2"]/div[@class="housetype__plots"]/div[not(contains(@data-status,"Sold"))]/div[@class="plot__price"]/text()').extract()
           plotprices = [plotprice.strip() for plotprice in plotprices]
           result = zip(plotnames, plotids, plotprices)
           for plotname, plotid, plotprice in result:
               item['plotname'] = plotname
               item['plotid'] = plotid
               item['plotprice'] = plotprice
               yield item

i suspect not ideal but I have found a work around to this problem. 我怀疑不理想,但我找到了解决这个问题的方法。 In the pipelines.py file I have added more code that essentially reads the csv file with the blank lines to a list, and so removes the blank lines and then writes that cleaned list to a new file. 在pipelines.py文件中,我添加了更多代码,基本上将空行读取到csv文件到列表,然后删除空行,然后将清理后的列表写入新文件。

the code I added is: 我添加的代码是:

with open('%s_items.csv' % spider.name, 'r') as f:
  reader = csv.reader(f)
  original_list = list(reader)
  cleaned_list = list(filter(None,original_list))

with open('%s_items_cleaned.csv' % spider.name, 'w', newline='') as output_file:
    wr = csv.writer(output_file, dialect='excel')
    for data in cleaned_list:
      wr.writerow(data)

and so the entire pipelines.py file is: 所以整个pipelines.py文件是:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import csv
from scrapy import signals
from scrapy.exporters import CsvItemExporter

class CSVPipeline(object):

  def __init__(self):
    self.files = {}

  @classmethod
  def from_crawler(cls, crawler):
    pipeline = cls()
    crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
    crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
    return pipeline

  def spider_opened(self, spider):
    file = open('%s_items.csv' % spider.name, 'w+b')
    self.files[spider] = file
    self.exporter = CsvItemExporter(file)
    self.exporter.fields_to_export = ["plotid","plotprice","plotname","name","address"]
    self.exporter.start_exporting()

  def spider_closed(self, spider):
    self.exporter.finish_exporting()
    file = self.files.pop(spider)
    file.close()

    #given I am using Windows i need to elimate the blank lines in the csv file
    print("Starting csv blank line cleaning")
    with open('%s_items.csv' % spider.name, 'r') as f:
      reader = csv.reader(f)
      original_list = list(reader)
      cleaned_list = list(filter(None,original_list))

    with open('%s_items_cleaned.csv' % spider.name, 'w', newline='') as output_file:
        wr = csv.writer(output_file, dialect='excel')
        for data in cleaned_list:
          wr.writerow(data)

  def process_item(self, item, spider):
    self.exporter.export_item(item)
    return item


class CharleschurchPipeline(object):
    def process_item(self, item, spider):
        return item

not ideal but solves the problem for now. 不理想但现在解决了这个问题。

The b in w+b is most probably part of the problem as this will make the file being considered a binary file and so linebreaks are written as is. bw+b最有可能是问题的一部分,因为这会使文件被认为是一个二进制文件,因此换行符被写成是。

So first step is to remove the b . 所以第一步是删除b And then by adding U you can also activate the Universal Newline support ( see: https://docs.python.org/3/glossary.html#term-universal-newlines ) 然后通过添加U您还可以激活Universal Newline支持(请参阅: https ://docs.python.org/3/glossary.html#term-universal-newlines)

So the line in question should look like: 所以有问题的行应该是这样的:

file = open('%s_items.csv' % spider.name, 'Uw+')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM