[英]Merge many sub-dataframes in a big dataframe in a loop in python pandas
My program will generate many small data frames such as the ones below 我的程序将生成许多小数据框,如下所示
Column_A Column_B
row1 1 2
Column_A Column_B
row2 3 4
Column_C Column_D
row1 5 6
Column_C Column_D
row2 7 8
I want them to be merged as 我希望他们合并为
Column_A Column_B Column_C Column_D
row1 1 2 5 6
row2 3 4 7 8
How can this be done when the dataframes need to be merged one at a time. 当需要一次合并一个数据帧时,如何做到这一点。 The code to generate the smaller data frames are
生成较小数据帧的代码是
df = {}
df[0] = pd.DataFrame({'Column_A' : [1],
'Column_B' : [2]},
index = ["row1"])
df[1] = pd.DataFrame({'Column_A' : [3],
'Column_B' : [4]},
index = ["row2"])
df[2] = pd.DataFrame({'Column_C' : [5],
'Column_D' : [6]},
index = ["row1"])
df[3] = pd.DataFrame({'Column_C' : [7],
'Column_D' : [8]},
index = ["row2"])
I tried using merge and concat but they always end up creating more columns by appending the existing columns with either _x, _y or just repeating the columns 我尝试使用merge和concat,但是他们总是通过使用_x,_y附加现有列或仅重复列来创建更多列
For example, the merge in following way 例如,以下列方式合并
pdf = pd.DataFrame()
for i in range(4):
pdf = pdf.merge(pd.DataFrame(df[i], index=["row{}".format(((i)%2)+1)]), how='outer', left_index=True, right_index=True)
produces 产生
Column_A_x Column_B_x Column_A_y Column_B_y Column_C_x Column_D_x \
row1 1.0 2.0 NaN NaN 5.0 6.0
row2 NaN NaN 3.0 4.0 NaN NaN
Column_C_y Column_D_y
row1 NaN NaN
row2 7.0 8.0
Can someone help me with the correct way to merge it 有人可以用正确的方式帮助我合并它
It would help you a lot, if you can in any way keep the left and right parts in separate containers. 如果你能以任何方式将左右部分保存在不同的容器中,它会对你有很大的帮助。 Eg columns A and B in one, columns C and D in the other.
例如,列A和B在一个中,列C和D在另一个中。 That way you could piece them together using
pandas.concat
quite easily. 这样你就可以很容易地使用
pandas.concat
将它们拼凑在一起。 After the 2 halves have been built you need to merge them using the index in this case. 在构建了两半之后,您需要在这种情况下使用索引合并它们。
With your original df
dictionary: 使用原始的
df
字典:
In [11]: pd.concat([df[0], df[1]]).merge(pd.concat([df[2], df[3]]), left_index=True, right_index=True)
Out[11]:
Column_A Column_B Column_C Column_D
row1 1 2 5 6
row2 3 4 7 8
With containers for left and right halves the code reads a bit better (and there's no need for a loop): 使用左右两半的容器,代码读得更好(并且不需要循环):
left = [pd.DataFrame({'Column_A' : [1],
'Column_B' : [2]},
index = ["row1"]),
pd.DataFrame({'Column_A' : [3],
'Column_B' : [4]},
index = ["row2"])]
right = [pd.DataFrame({'Column_C' : [5],
'Column_D' : [6]},
index = ["row1"]),
pd.DataFrame({'Column_C' : [7],
'Column_D' : [8]},
index = ["row2"])]
df = pd.concat(left).merge(pd.concat(right), left_index=True, right_index=True)
Finally if you truly have no option but to store them in a dictionary like in your example: 最后,如果您真的别无选择,只能将它们存储在类似示例的字典中:
from functools import reduce, partial
from itertools import groupby
pdf = reduce(
partial(pd.merge, left_index=True, right_index=True, how='outer'),
(pd.concat(list(g))
for cols, g in groupby(sorted(df.values(),
key=lambda df_: tuple(df_.columns)),
lambda df_: tuple(df_.columns)))
)
try this: 试试这个:
In [186]: result = pd.concat([df[key].reset_index() for key in df.keys()],
.....: ignore_index=True) \
.....: .set_index('index') \
.....: .groupby(level=0) \
.....: .sum() \
.....: .astype(int)
In [187]: result
Out[187]:
Column_A Column_B Column_C Column_D
index
row1 1 2 5 6
row2 3 4 7 8
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.