简体   繁体   English

如何将编码为字典列表的表格直接写入包含 CSV 的压缩存档?

[英]How to write a table encoded as a list of dictionaries directly to a zipped archive containing a CSV?

Suppose you have data in the form of a list of dictionaries like d here:假设您有像d这样的字典列表形式的数据:

d = [{'a' : 1, 'b' : 2}, {'a' : 3, 'c' : 5}]

and you want to save it as a comma-separated table to a zipped (not gzipped, I mean a .zip archive) CSV without going via, eg, a pandas.DataFrame.from_dict() .并且您想将它作为逗号分隔的表格保存到压缩(不是 gzip 压缩,我的意思是.zip存档)CSV中,而不通过例如pandas.DataFrame.from_dict()

Why not via pandas ?为什么不通过pandas Because d in real practice may correspond to a very large, but especially sparse , DataFrame, ie a table with many more columns than non-NA data per row, which for some reason occupies a huge amount of memory (BTW this is not a theory: it made our scripts crash several times, hence our need to work around it).因为d在实际实践中可能对应一个非常大但特别稀疏的 DataFrame,即每行的列数比非 NA 数据多得多的表,由于某种原因占用了大量内存(顺便说一句,这不是理论:它使我们的脚本多次崩溃,因此我们需要解决它)。

d is a sort of unpivoted-in-disguise version of the data, because each dictionary only contains the relevant data, not a useless sequence of NA's. d是数据的一种伪装版本,因为每个字典只包含相关数据,而不是无用的 NA 序列。

From the csv module's documentation I learned how to write d directly to a CSV:csv模块的文档中,我学会了如何将d直接写入 CSV:

with open('test.csv', 'w') as csvfile :
    writer = csv.DictWriter(csvfile, fieldnames = ['a','b','c'])
    writer.writeheader()
    writer.writerows(d)

but I don't see any option to write to a zipped CSV.但我看不到任何写入压缩CSV 的选项。

I consulted the documentation of zipfile , but I could not make it work, due to the usual problem between text and bytes.我查阅了zipfile文档,但由于文本和字节之间的常见问题,我无法使其工作。

if os.path.exists('test.csv.zip') :
    os.remove('test.csv.zip')
with zipfile.ZipFile('test.csv.zip', mode = 'a') as zip :
    with zip.open('test.csv', 'w') as csvfile :
        writer = csv.DictWriter(csvfile, fieldnames = ['a','b','c'])
        writer.writeheader()
        writer.writerows(d)

# TypeError: a bytes-like object is required, not 'str'

Can anyone think of a workaround, or maybe a radically different approach that I am not seeing?谁能想到一种解决方法,或者可能是我没有看到的完全不同的方法?

The fundamental constraints are:基本约束是:

  1. d is always going to be generated: this we cannot decide or change d总是会产生:这是我们无法决定或改变的
  2. avoid generating very large objects that consume as much memory or disk space as the dense pandas.DataFrame.from_dict()避免生成消耗与密集pandas.DataFrame.from_dict()一样多的内存或磁盘空间的非常大的对象
  3. the data must be written to a csv.zip archive.数据必须写入 csv.zip 存档。

Otherwise we would write to a CSV, hoping that it is not too huge (but yeah, that was the initial issue, so...), and zip it afterwards.否则我们会写入一个 CSV,希望它不会太大(但是,是的,那是最初的问题,所以......),然后压缩它。


EDIT posting the implementation from Daweo's answer, for completeness.为了完整起见,编辑从 Daweo 的答案中发布实现。

import os
import zipfile
import csv
import codecs
utf8 = codecs.getwriter('utf_8') # or other encoding dictated by requirements

output_zip_file = 'test.csv.zip'

if os.path.exists(output_zip_file) :
    os.remove(output_zip_file)
with zipfile.ZipFile(output_zip_file, mode = 'a') as zip :
    with zip.open('out.csv', 'w') as csvfile :
        writer = csv.DictWriter(utf8(csvfile), fieldnames = ['a','b','c'])
        writer.writeheader()
        writer.writerows(d)

You might use codecs.StreamWriter if you want to use csv.DictWriter with binary file-handle, consider following simple example如果要将csv.DictWriter与二进制文件句柄一起使用,则可以使用codecs.StreamWriter ,请考虑以下简单示例

import csv
import codecs
utf8 = codecs.getwriter('utf_8') # or other encoding dictated by requirements
with open("file.csv","wb") as f:
    writer = csv.DictWriter(utf8(f), fieldnames = ['a','b','c'])
    writer.writeheader()
    writer.writerows([{'a':1},{'b':2},{'c':3}])

creates file.csv holding创建file.csv持有

a,b,c
1,,
,2,
,,3

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM