简体   繁体   English

Python 3.6:比较两个较大的压缩后的CSV文件并获取差异记录

[英]Python 3.6: Compare two large gzipped csv files & fetch difference records

I have 2 gzipped csv files IMFBOP2017_1.csv.gz and IMFBOP2017_2.csv.gz with same columns in both file ie "Location, Indicator, Measure, Unit, Frequency, Date" . 我有2个gzip压缩的csv文件IMFBOP2017_1.csv.gzIMFBOP2017_2.csv.gz ,在两个文件中均具有相同的列,即"Location, Indicator, Measure, Unit, Frequency, Date"

Total rows 60 millions+ 总行数超过6000万

I want to compare both file & display rows of IMFBOP2017_1 that are not present in IMFBOP2017_2 . 我想比较IMFBOP2017_1中不存在的IMFBOP2017_2文件和显示行。

My plan is to import both files to dataframes , add an extra column "compare" to both dataframes and update it by all fields merge like 我的计划是将两个文件都导入到数据框中,在两个数据框中添加一个额外的列“比较”,并通过合并所有字段来更新它

Location|Indicator|Measure|Unit|Frequence|Date and do NOT IN operation. 位置|指示符|度量|单位|频率|日期,请勿运行。

I think this is a costly process, is there any simple solution for this? 我认为这是一个昂贵的过程,是否有任何简单的解决方案?

Pandas can read gzipped data files with the ordinary pandas.read_csv() . 熊猫可以使用普通的pandas.read_csv()读取压缩后的数据文件。 How to do a diff between two dataframes is described in Pandas: Diff of two Dataframes . 如何在两个数据帧之间进行区分在《 熊猫:两个数据帧的区分》中有所描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM