[英]How to join two very large dataframes together with same columns?
I have two datasets that look like this:我有两个如下所示的数据集:
df1
: df1
:
Date![]() |
City![]() |
State![]() |
Quantity![]() |
---|---|---|---|
2019-01 ![]() |
Chicago![]() |
IL![]() |
35 ![]() |
2019-01 ![]() |
Orlando![]() |
FL![]() |
322 ![]() |
... ![]() |
.... ![]() |
... ![]() |
... ![]() |
2021-07 ![]() |
Chicago![]() |
IL![]() |
334 ![]() |
2021-07 ![]() |
Orlando![]() |
FL![]() |
4332 ![]() |
df2
: df2
:
Date![]() |
City![]() |
State![]() |
Sales![]() |
---|---|---|---|
2019-01 ![]() |
Chicago![]() |
IL![]() |
30 ![]() |
2019-01 ![]() |
Orlando![]() |
FL![]() |
319 ![]() |
... ![]() |
... ![]() |
... ![]() |
... ![]() |
2021-07 ![]() |
Chicago![]() |
IL![]() |
331 ![]() |
2021-07 ![]() |
Orlando![]() |
FL![]() |
4000 ![]() |
They are EXTREMELY large datasets, to the point where pd.merge()
and dd.merge()
do not work, and my kernel gives me memory errors.它们是非常大的数据集,以至于pd.merge
pd.merge()
和dd.merge()
不起作用,并且我的内核给了我内存错误。 However, I found that concatenating the two of those does not give me the memory error.但是,我发现将这两者连接起来不会给我带来内存错误。 My desired dataset,
out2
looks like this:我想要的数据集
out2
如下所示:
Date![]() |
City![]() |
State![]() |
Quantity![]() |
Sales![]() |
---|---|---|---|---|
2019-01 ![]() |
Chicago![]() |
IL![]() |
35 ![]() |
30 ![]() |
2019-01 ![]() |
Orlando![]() |
FL![]() |
322 ![]() |
319 ![]() |
... ![]() |
... ![]() |
... ![]() |
... ![]() |
... ![]() |
2021-07 ![]() |
Chicago![]() |
IL![]() |
334 ![]() |
331 ![]() |
2021-07 ![]() |
Orlando![]() |
FL![]() |
4332 ![]() |
4000 ![]() |
I used the following code:我使用了以下代码:
out2=dd.concat([df1,df2],join='outer')
but my new dataset looks like this:但我的新数据集如下所示:
Date![]() |
City![]() |
State![]() |
Quantity![]() |
Sales![]() |
---|---|---|---|---|
2019-01 ![]() |
Chicago![]() |
IL![]() |
35 ![]() |
NaN![]() |
2019-01 ![]() |
Orlando![]() |
FL![]() |
322 ![]() |
NaN![]() |
2019-01 ![]() |
Chicago![]() |
IL![]() |
NaN![]() |
30 ![]() |
2019-01 ![]() |
Orlando![]() |
FL![]() |
NaN![]() |
319 ![]() |
... ![]() |
... ![]() |
... ![]() |
... ![]() |
... ![]() |
2021-07 ![]() |
Chicago![]() |
IL![]() |
334 ![]() |
NaN![]() |
2021-07 ![]() |
Orlando![]() |
FL![]() |
4332 ![]() |
NaN![]() |
2021-07 ![]() |
Chicago![]() |
IL![]() |
NaN![]() |
331 ![]() |
2021-07 ![]() |
Orlando![]() |
FL![]() |
NaN![]() |
4000 ![]() |
How can I get my desired dataset without running into memory error issues, without using the pd.merge
function?如何在不使用
pd.merge
函数的情况下获得所需的数据集而不会遇到内存错误问题?
If performance is not critical, you could create a defaultdict of dict and use the first three values as the dict key and then add quantity and sales to the value dict.如果性能不重要,您可以创建 dict 的 defaultdict 并将前三个值用作 dict 键,然后将数量和销售额添加到值 dict。 This would allow you to process the files without reading them into memory first.
这将允许您处理文件而无需先将它们读入内存。
from collections import defaultdict
from pathlib import Path
paths = [(Path.home() / 'file1.csv', 'Quantity'), (Path.home() / 'file2.csv', 'Sales')]
results = defaultdict(dict)
for path, value_column in paths:
with path.open('r') as f:
for line in f:
parts = [s.strip() for s in line.split(',')]
key = tuple(parts[0:-1])
results[key][value_column] = parts[-1]
combined = pd.concat([pd.DataFrame(data=list(results.keys()), columns=['Date', 'City', 'State']),
pd.DataFrame(list(results.values()))], axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.