如何将两个非常大的数据框与相同的列连接在一起？

Question

I have two datasets that look like this:我有两个如下所示的数据集：

df1 : df1 ：

Date日期	City城市	State状态	Quantity数量
2019-01 2019-01	Chicago芝加哥	IL伊利诺伊州	35 35
2019-01 2019-01	Orlando奥兰多	FL佛罗里达州	322 322
... ...	.... ……	... ...	... ...
2021-07 2021-07	Chicago芝加哥	IL伊利诺伊州	334 334
2021-07 2021-07	Orlando奥兰多	FL佛罗里达州	4332 4332

df2 : df2 ：

Date日期	City城市	State状态	Sales销售量
2019-01 2019-01	Chicago芝加哥	IL伊利诺伊州	30 30
2019-01 2019-01	Orlando奥兰多	FL佛罗里达州	319 319
... ...	... ...	... ...	... ...
2021-07 2021-07	Chicago芝加哥	IL伊利诺伊州	331 331
2021-07 2021-07	Orlando奥兰多	FL佛罗里达州	4000 4000

They are EXTREMELY large datasets, to the point where pd.merge() and dd.merge() do not work, and my kernel gives me memory errors.它们是非常大的数据集，以至于pd.merge pd.merge()和dd.merge()不起作用，并且我的内核给了我内存错误。 However, I found that concatenating the two of those does not give me the memory error.但是，我发现将这两者连接起来不会给我带来内存错误。 My desired dataset, out2 looks like this:我想要的数据集out2如下所示：

Date日期	City城市	State状态	Quantity数量	Sales销售量
2019-01 2019-01	Chicago芝加哥	IL伊利诺伊州	35 35	30 30
2019-01 2019-01	Orlando奥兰多	FL佛罗里达州	322 322	319 319
... ...	... ...	... ...	... ...	... ...
2021-07 2021-07	Chicago芝加哥	IL伊利诺伊州	334 334	331 331
2021-07 2021-07	Orlando奥兰多	FL佛罗里达州	4332 4332	4000 4000

I used the following code:我使用了以下代码：

out2=dd.concat([df1,df2],join='outer')

but my new dataset looks like this:但我的新数据集如下所示：

Date日期	City城市	State状态	Quantity数量	Sales销售量
2019-01 2019-01	Chicago芝加哥	IL伊利诺伊州	35 35	NaN钠
2019-01 2019-01	Orlando奥兰多	FL佛罗里达州	322 322	NaN钠
2019-01 2019-01	Chicago芝加哥	IL伊利诺伊州	NaN钠	30 30
2019-01 2019-01	Orlando奥兰多	FL佛罗里达州	NaN钠	319 319
... ...	... ...	... ...	... ...	... ...
2021-07 2021-07	Chicago芝加哥	IL伊利诺伊州	334 334	NaN钠
2021-07 2021-07	Orlando奥兰多	FL佛罗里达州	4332 4332	NaN钠
2021-07 2021-07	Chicago芝加哥	IL伊利诺伊州	NaN钠	331 331
2021-07 2021-07	Orlando奥兰多	FL佛罗里达州	NaN钠	4000 4000

How can I get my desired dataset without running into memory error issues, without using the pd.merge function?如何在不使用pd.merge函数的情况下获得所需的数据集而不会遇到内存错误问题？

Answer 1

If performance is not critical, you could create a defaultdict of dict and use the first three values as the dict key and then add quantity and sales to the value dict.如果性能不重要，您可以创建 dict 的 defaultdict 并将前三个值用作 dict 键，然后将数量和销售额添加到值 dict。 This would allow you to process the files without reading them into memory first.这将允许您处理文件而无需先将它们读入内存。

from collections import defaultdict
from pathlib import Path

paths = [(Path.home() / 'file1.csv', 'Quantity'), (Path.home() / 'file2.csv', 'Sales')]
results = defaultdict(dict)

for path, value_column in paths:
    with path.open('r') as f:
        for line in f:
            parts = [s.strip() for s in line.split(',')]
            key = tuple(parts[0:-1])
            results[key][value_column] = parts[-1]

combined = pd.concat([pd.DataFrame(data=list(results.keys()), columns=['Date', 'City', 'State']), 
                      pd.DataFrame(list(results.values()))], axis=1)

如何将两个非常大的数据框与相同的列连接在一起？

问题描述

1 个解决方案

解决方案1
0 已采纳 2022-05-14 02:24:15

如何将两个非常大的数据框与相同的列连接在一起？

问题描述

1 个解决方案

解决方案1 0 已采纳 2022-05-14 02:24:15

解决方案1
0 已采纳 2022-05-14 02:24:15