[英]Python 3.6: Compare two large gzipped csv files & fetch difference records
I have 2 gzipped csv files IMFBOP2017_1.csv.gz
and IMFBOP2017_2.csv.gz
with same columns in both file ie "Location, Indicator, Measure, Unit, Frequency, Date"
. 我有2个gzip压缩的csv文件
IMFBOP2017_1.csv.gz
和IMFBOP2017_2.csv.gz
,在两个文件中均具有相同的列,即"Location, Indicator, Measure, Unit, Frequency, Date"
。
Total rows 60 millions+ 总行数超过6000万
I want to compare both file & display rows of IMFBOP2017_1
that are not present in IMFBOP2017_2
. 我想比较
IMFBOP2017_1
中不存在的IMFBOP2017_2
文件和显示行。
My plan is to import both files to dataframes , add an extra column "compare" to both dataframes and update it by all fields merge like 我的计划是将两个文件都导入到数据框中,在两个数据框中添加一个额外的列“比较”,并通过合并所有字段来更新它
Location|Indicator|Measure|Unit|Frequence|Date and do NOT IN operation.
位置|指示符|度量|单位|频率|日期,请勿运行。
I think this is a costly process, is there any simple solution for this? 我认为这是一个昂贵的过程,是否有任何简单的解决方案?
Pandas can read gzipped data files with the ordinary pandas.read_csv()
. 熊猫可以使用普通的
pandas.read_csv()
读取压缩后的数据文件。 How to do a diff between two dataframes is described in Pandas: Diff of two Dataframes . 如何在两个数据帧之间进行区分在《 熊猫:两个数据帧的区分》中有所描述 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.