[英]How to write a table encoded as a list of dictionaries directly to a zipped archive containing a CSV?
Suppose you have data in the form of a list of dictionaries like d
here:假设您有像d
这样的字典列表形式的数据:
d = [{'a' : 1, 'b' : 2}, {'a' : 3, 'c' : 5}]
and you want to save it as a comma-separated table to a zipped (not gzipped, I mean a .zip archive) CSV without going via, eg, a pandas.DataFrame.from_dict()
.并且您想将它作为逗号分隔的表格保存到压缩(不是 gzip 压缩,我的意思是.zip存档)CSV中,而不通过例如pandas.DataFrame.from_dict()
。
Why not via pandas
?为什么不通过pandas
? Because d
in real practice may correspond to a very large, but especially sparse , DataFrame, ie a table with many more columns than non-NA data per row, which for some reason occupies a huge amount of memory (BTW this is not a theory: it made our scripts crash several times, hence our need to work around it).因为d
在实际实践中可能对应一个非常大但特别稀疏的 DataFrame,即每行的列数比非 NA 数据多得多的表,由于某种原因占用了大量内存(顺便说一句,这不是理论:它使我们的脚本多次崩溃,因此我们需要解决它)。
d
is a sort of unpivoted-in-disguise version of the data, because each dictionary only contains the relevant data, not a useless sequence of NA's. d
是数据的一种伪装版本,因为每个字典只包含相关数据,而不是无用的 NA 序列。
From the csv
module's documentation I learned how to write d
directly to a CSV:从csv
模块的文档中,我学会了如何将d
直接写入 CSV:
with open('test.csv', 'w') as csvfile :
writer = csv.DictWriter(csvfile, fieldnames = ['a','b','c'])
writer.writeheader()
writer.writerows(d)
but I don't see any option to write to a zipped CSV.但我看不到任何写入压缩CSV 的选项。
I consulted the documentation of zipfile
, but I could not make it work, due to the usual problem between text and bytes.我查阅了zipfile
的文档,但由于文本和字节之间的常见问题,我无法使其工作。
if os.path.exists('test.csv.zip') :
os.remove('test.csv.zip')
with zipfile.ZipFile('test.csv.zip', mode = 'a') as zip :
with zip.open('test.csv', 'w') as csvfile :
writer = csv.DictWriter(csvfile, fieldnames = ['a','b','c'])
writer.writeheader()
writer.writerows(d)
# TypeError: a bytes-like object is required, not 'str'
Can anyone think of a workaround, or maybe a radically different approach that I am not seeing?谁能想到一种解决方法,或者可能是我没有看到的完全不同的方法?
The fundamental constraints are:基本约束是:
d
is always going to be generated: this we cannot decide or change d
总是会产生:这是我们无法决定或改变的pandas.DataFrame.from_dict()
避免生成消耗与密集pandas.DataFrame.from_dict()
一样多的内存或磁盘空间的非常大的对象Otherwise we would write to a CSV, hoping that it is not too huge (but yeah, that was the initial issue, so...), and zip it afterwards.否则我们会写入一个 CSV,希望它不会太大(但是,是的,那是最初的问题,所以......),然后压缩它。
EDIT posting the implementation from Daweo's answer, for completeness.为了完整起见,编辑从 Daweo 的答案中发布实现。
import os
import zipfile
import csv
import codecs
utf8 = codecs.getwriter('utf_8') # or other encoding dictated by requirements
output_zip_file = 'test.csv.zip'
if os.path.exists(output_zip_file) :
os.remove(output_zip_file)
with zipfile.ZipFile(output_zip_file, mode = 'a') as zip :
with zip.open('out.csv', 'w') as csvfile :
writer = csv.DictWriter(utf8(csvfile), fieldnames = ['a','b','c'])
writer.writeheader()
writer.writerows(d)
You might use codecs.StreamWriter
if you want to use csv.DictWriter
with binary file-handle, consider following simple example如果要将csv.DictWriter
与二进制文件句柄一起使用,则可以使用codecs.StreamWriter
,请考虑以下简单示例
import csv
import codecs
utf8 = codecs.getwriter('utf_8') # or other encoding dictated by requirements
with open("file.csv","wb") as f:
writer = csv.DictWriter(utf8(f), fieldnames = ['a','b','c'])
writer.writeheader()
writer.writerows([{'a':1},{'b':2},{'c':3}])
creates file.csv
holding创建file.csv
持有
a,b,c
1,,
,2,
,,3
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.