简体   繁体   English

使用Pandas或AWK合并2个不相等的文件

[英]Merging 2 largely unequal files using pandas or awk

I have 2 csv that need to be merged, File1 is around 20gb and other only has ~1000 lines. 我有2个需要合并的csv,File1约为20gb ,其他仅~1000行。 Because of the large size I first iterate over bigger file and want to filter the bigger file (say file1) to a smaller file and then load/merge using pandas 由于尺寸较大,我首先遍历较大的文件,并希望将较大的文件(例如file1)过滤为较小的文件,然后使用熊猫加载/合并

File1: The bigger file is as below File1:较大的文件如下

col1,col2,col3
1,1,1491795901021327897
1,2,1491795901021342873
1,3,1491795901021347247
1,4,1491795901021351620
1,5,1491795901021356612
1,6,1491795901021361172
1,7,1491795901021366797

The smaller file is as below 较小的文件如下

col1,col2,col3,col4,col5,col6
val1,val2,val3,1,6,1412414141412414
val1,val2,val3,1,3,1434252352352325

One way I did was to create a single key from both files by doing 10*10**10(value at col1) + val at col2 and similarly in smaller file using col4,5. 我做的一种方法是,通过10*10**10(value at col1) + val at col2执行10*10**10(value at col1) + val at col2从两个文件中创建单个密钥,类似地,在较小文件中使用col4,5进行创建。 saving this values as a list and for each line in bigger file if value is present in list print that row. 将此值保存为列表,如果列表中存在值,则在较大的文件中为每一行打印该行。 So finally a small filtered file is printed. 因此,最终将打印出一个小的过滤文件。 Is there a better way to do this in python or using awk maybe. 有没有更好的方法可以在python或使用awk中做到这一点。

Ultimate intent is to merge, but since 20gb cannot be loaded in pandas so I'm filtering my file and making it smaller. 最终目的是合并,但是由于无法在熊猫中加载20gb,因此我正在过滤文件并将其缩小。 I'm sure there must be a better way to approach this. 我确信必须有更好的方法来解决这个问题。

awk to the rescue! awk解救!

by extrapolation, I think that's what you want 通过推断,我认为这就是你想要的

$ awk 'BEGIN        {FS=OFS=","} 
       NR==1        {h=$0; next} 
       NR==FNR      {a[$4,$5]=$0; next} 
       FNR==1       {print h,$3} 
       ($1,$2) in a {print a[$1,$2],$3}' small large

col1,col2,col3,col4,col5,col6,col3
val1,val2,val3,1,3,1434252352352325,1491795901021347247
val1,val2,val3,1,6,1412414141412414,1491795901021361172

It should be easy to read but I can write explanation if I get feedback on my interpretation. 它应该很容易阅读,但是如果我对自己的解释有任何反馈,我可以写解释。

Try this - 尝试这个 -

$ head f?
==> f1 <==
col1,col2,col3
1,1,1491795901021327897
1,2,1491795901021342873
1,3,1491795901021347247
1,4,1491795901021351620
1,5,1491795901021356612
1,6,1491795901021361172
1,7,1491795901021366797

==> f2 <==
col1,col2,col3,col4,col5,col6
val1,val2,val3,1,6,1412414141412414
val1,val2,val3,1,3,1434252352352325
$ awk -F, 'NR==FNR{a[$4 FS $5]=$6;next} ($1 FS $2) in a {print $0 FS a[$1 FS $2]}' f2 f1
1,3,1491795901021347247,1434252352352325
1,6,1491795901021361172,1412414141412414

Explained - 解释-

Created key using $4FS$5 from file f2 and matching it with key of $1FS$2 of file f1 , if $4FS$5 of f2 is matching with $1FS$2 then print all contents from file f1 along with the $6 from file f2 . 使用文件f2 $4FS$5创建的密钥,并将其与文件f1$1FS$2的密钥匹配,如果f2 $4FS$5$1FS$2匹配,则打印文件f1所有内容以及文件f2$6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM