简体   繁体   English

加入两个巨大的文件,而不会与熊猫分块

[英]Join two huge file without chunking with pandas

I have File1 with "id,name" and File2 with "id,address". 我的File1带有“ id,name”,而File2带有“ id,address”。 I cannot load the first file (less than 2Gb): it crashes after 76k rows (with chunk concat) and only 2 columns... I cannot read_csv on the second file too because it crashes the kernel after some rows loading. 我无法加载第一个文件(小于2Gb):它在76k行(带有大块concat)并且只有2列后崩溃……我也无法在第二个文件上读取read_csv,因为它在加载某些行后使内核崩溃。

I need to join the File1 and File2 with "id" but if I cannot put files in a dataframe variable I don't know how to do... 我需要使用“ id”将File1和File2连接起来,但是如果我不能将文件放入dataframe变量中,我将不知道该怎么做...

The file is only 5Gb with 30M rows but it crashes the kernel after few seconds of loading. 该文件只有5Gb且有3000万行,但在加载几秒钟后会导致内核崩溃。

How to join the file without dataframing em please ? 请问如何在没有数据帧的情况下加入文件?

I have tried with chucking but it crashes. 我已经尝试过卡盘,但是它崩溃了。

chunks = []
cols = [...]
for chunk in pd.read_csv("file2.csv", chunksize=500000, sep=',', error_bad_lines=False, low_memory=False, usecols=cols):
    chunks.append(chunk)
df = pd.concat(chunks, axis=0)
print(f.shape)

I need the dataframe to load to join them or join the file without loading if possible 如果可能,我需要加载数据框以加入它们或加入文件而不加载

You read df2 chunk by chunk but since you append all the chunks, your resulting chunk is of the same size as your file2. 您逐块读取df2块,但是由于附加了所有块,因此生成的块与file2的大小相同。

What you could do, if you are able to fully load your df1, is to join your df2 chunk by chunk like so : 如果能够完全加载df1,您可以做的就是像这样逐个加入df2块:

for chunk in pd.read_csv("file2.csv", chunksize=500000, sep=',', error_bad_lines=False, low_memory=False, usecols=cols):
    df1.merge(chunk, on =['id'], how='left')

Chunking like that will definitely still crash your kernel, since you're still trying to fit everything into memory. 这样的块化肯定仍会使您的内核崩溃,因为您仍在尝试将所有内容都放入内存中。 You need to do something to your chunks to reduce their size. 您需要对块进行一些操作以减小其大小。

For instance, you could read both files in chunks, join each chunk, output the matches to another file, and keep the un-matched IDs in memory. 例如,您可以分块读取两个文件,合并每个块,将匹配项输出到另一个文件,然后将不匹配的ID保留在内存中。 That might still crash your kernel if you get unlucky though. 如果不幸,那仍然可能使内核崩溃。 It depends on what your performance constraints are, and what you need to do with your data afterwards. 这取决于您的性能约束,以及之后需要处理的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM