[英]How to concatenate pandas dataframes with automatic keys?
Following on an earlier question继较早的问题
I have我有
df1 = pd.Dataframe(
[
{'a': 1},
{'a': 2},
{'a': 3},
]
)
df2 = pd.Dataframe(
[
{'a': 4},
{'a': 5},
]
)
And I want而且我要
df_id a
1 1
2
3
2 4
5
I accepted an answer too soon, that told me to do我太早接受了一个答案,告诉我要做
pd.concat([df1, df2], keys=[1,2])
which gives the correct result, but [1,2] is hardcoded.这给出了正确的结果,但 [1,2] 是硬编码的。
I also want this to be incremental, meaning given我也希望这是增量的,意思是给定的
df3 DF3
df_id a
1 1
2
3
2 4
5
and和
df4 = pd.Dataframe(
[
{'a': 6},
{'a': 7},
]
)
I want the concatenation to give我想要连接给
df_id a
1 1
2
3
2 4
5
3 6
7
Using the same function.使用相同的 function。
How can I achieve this correctly?我怎样才能正确地做到这一点?
EDIT : A discount- I can manage with only the incrementing function. It doesn't have to work with the single level dfs, but it would be nice if it did.编辑:折扣 - 我只能使用递增的 function 进行管理。它不必与单级 dfs 一起使用,但如果可以的话会很好。
IIUC, IIUC,
def split_list_by_multitindex(l):
l_multi, l_not_multi = [], []
for df in l:
if isinstance(df.index, pd.MultiIndex):
l_multi.append(df)
else:
l_not_multi.append(df)
return l_multi, l_not_multi
def get_start_key(df):
return df.index.get_level_values(0)[-1]
def concat_starting_by_key(l, key):
return pd.concat(l, keys=range(key, key+len(l))) \
if len(l) > 1 else set_multiindex_in_df(l[0], key)
def set_multiindex_in_df(df, key):
return df.set_axis(pd.MultiIndex.from_product(([key], df.index)))
def myconcat(l):
l_multi, l_not_multi = split_list_by_multitindex(l)
return pd.concat([*l_multi,
concat_starting_by_key(l_not_multi,
get_start_key(l_multi[-1]) + 1)
]) if l_multi else concat_starting_by_key(l_not_multi, 1)
Examples例子
l1 = [df1, df2]
print(myconcat(l1))
a
1 0 1
1 2
2 3
2 0 4
1 5
l2 = [myconcat(l1), df4]
print(myconcat(l2))
a
1 0 1
1 2
2 3
2 0 4
1 5
3 0 6
1 7
myconcat([df4, myconcat([df1, df2]), df1, df2])
a
1 0 1
1 2
2 3
2 0 4
1 5
3 0 6
1 7
4 0 1
1 2
2 3
5 0 4
1 5
Note笔记
This assumes that if we make a concatenation of the dataframes belonging to the l_multi
list
, the resulting dataframe would already be ordered这假设如果我们连接属于l_multi
list
的数据帧,则结果 dataframe 已经被排序
My approach was to nest two pd.concat
functions, the second one to create a MultiIndex
dataframe, from a single index.我的方法是嵌套两个pd.concat
函数,第二个函数从单个索引创建MultiIndex
dataframe。
import pandas as pd
df = pd.DataFrame(
[
{'a': 1},
{'a': 2},
{'a': 3},
]
)
df2 = pd.DataFrame(
[
{'a': 4},
{'a': 5},
]
)
df = pd.concat([df, df2], keys=df.index.get_level_values(0))
In[2]: df
Out[2]:
a
0 0 1
1 2
2 3
1 0 4
1 5
And to merge a new dataframe:并合并一个新的 dataframe:
df3 = pd.DataFrame(
[
{'a': 6},
{'a': 7},
]
)
In[3]: pd.concat([df, pd.concat([df3,], keys=(max(df.index.get_level_values(0))+1,))])
Out[3]:
a
0 0 1
1 2
2 3
1 0 4
1 5
2 0 6
1 7
EDIT : Following the comment from ansev saying that this method was inefficent, ran some simple test.编辑:根据 ansev 的评论说这种方法效率低下,进行了一些简单的测试。 This is the output:这是 output:
In[5]: %timeit pd.concat([df, pd.concat([df3,], keys=(max(df.index.get_level_values(0))+1,))])
Out[5]: 1.99 ms ± 98.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Comparing to his method:对比他的方法:
In[6]: %timeit [myconcat(l1), df3]
Out[6]: 1.92 ms ± 96.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This is how I solved it我就是这样解决的
import pandas as pd
df1 = pd.DataFrame(
[
{'a': 1},
{'a': 2},
{'a': 3},
]
)
df2 = pd.DataFrame(
[
{'a': 4},
{'a': 5},
]
)
df = df1.append(df2)
df['from'] = df.index == 0
df['from'] = df['from'].cumsum()
df = df[['from', 'a']]
print(df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.