简体   繁体   English

如何使用多处理与python熊猫合并两个csv文件

[英]How to merge two csv files using multiprocessing with python pandas

I want to merge two csv files with common column using python panda With 32 bit processor after 2 gb memory it will throw memory error how can i do the same with multi processing or any other methods 我想使用python熊猫将具有公共列的两个csv文件合并到2 GB内存后使用32位处理器,它将抛出内存错误,我该如何使用多重处理或任何其他方法执行相同的操作

import gc
import pandas as pd
csv1_chunk = pd.read_csv('/home/subin/Desktop/a.txt',dtype=str, iterator=True, chunksize=1000)
csv1 = pd.concat(csv1_chunk, axis=1, ignore_index=True)
csv2_chunk = pd.read_csv('/home/subin/Desktop/b.txt',dtype=str, iterator=True, chunksize=1000)
csv2 = pd.concat(csv2_chunk, axis=1, ignore_index=True)
new_df = csv1[csv1["PROFILE_MSISDN"].isin(csv2["L_MSISDN"])]
new_df.to_csv("/home/subin/Desktop/apyb.txt", index=False)
gc.collect()

please help me to fix this 请帮我解决这个问题

thanks in advance 提前致谢

I think you only need one column from your second file (actually, only unique elements from this column are needed), so there is no need to load the whole data frame. 我认为您只需要第二个文件中的一列(实际上,只需要该列中的唯一元素),因此无需加载整个数据帧。

import pandas as pd

csv2 = pd.read_csv('/home/subin/Desktop/b.txt', usecols=['L_MSISDN'])
unique_msidns = set(csv2['L_MSISDN'])

If this still gives a memory error, try doing this in chunks: 如果仍然出现内存错误,请尝试分批执行:

chunk_reader = pd.read_csv('/home/subin/Desktop/b.txt', usecols=['L_MSISDN'], chunksize=1000)
unique_msidns = set()
for chunk in chunk_reader:
    unique_msidns = unique_msidns | set(chunk['L_MSIDNS'])

Now, we can deal with the first data frame. 现在,我们可以处理第一个数据帧。

chunk_reader = pd.read_csv('/home/subin/Desktop/a.txt', chunksize=1000)
for chunk in chunk_reader:
    bool_idx = chunk['PROFILE_MSISDN'].isin(unique_msidns)
    # *append* selected lines from every chunk to a file (mode='a')
    # col names are not written
    chunk[bool_idx].to_csv('output_file', header=False, index=False, mode='a')

If you need column names to be written into the output file, you can do it with the first chunk (I've skipped it for code clarity). 如果需要将列名写入输出文件,则可以使用第一个块(为了代码清晰起见,已跳过了它)。

I believe it's safe (and probably faster) to increase chunksize . 我相信增加chunksize是安全的(并且可能更快)。

I didn't test this code, so be sure to double check it. 我没有测试此代码,因此请务必仔细检查。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM