[英]Pandas: slice one multiindex dataframe with multiindex of another when some levels don't match
I have two multiindexed dataframes, one with two levels and one with three.我有两个多索引数据框,一个有两个级别,一个有三个级别。 The first two levels match in both dataframes.
前两个级别在两个数据帧中都匹配。 I would like to find all values from the first dataframe where the first two index levels match in the second dataframe.
我想从第一个数据帧中找到前两个索引级别在第二个数据帧中匹配的所有值。 The second data frame does not have a third level.
第二个数据帧没有第三级。
The closest answer I have found is this: How to slice one MultiIndex DataFrame with the MultiIndex of another -- however the setup is slightly different and doesn't seem to translate to this case.我找到的最接近的答案是: 如何使用另一个 MultiIndex 对一个 MultiIndex DataFrame 进行切片——但是设置略有不同,似乎没有转化为这种情况。
Consider the setup below考虑下面的设置
array_1 = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']),
np.array(['a', 'a','a', 'a','b','b','b','b' ])]
array_2 = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'three', 'one', 'two', 'two', 'one', 'two'])]
df_1 = pd.DataFrame(np.random.randn(8,4), index=array_1).sort_index()
print df_1
0 1 2 3
bar one a 1.092651 -0.325324 1.200960 -0.790002
two a -0.415263 1.006325 -0.077898 0.642134
baz one a -0.343707 0.474817 0.396702 -0.379066
two a 0.315192 -1.548431 -0.214253 -1.790330
foo one b 1.022050 -2.791862 0.172165 0.924701
two b 0.622062 -0.193056 -0.145019 0.763185
qux one b -1.241954 -1.270390 0.147623 -0.301092
two b 0.778022 1.450522 0.683487 -0.950528
df_2 = pd.DataFrame(np.random.randn(8,4), index=array_2).sort_index()
print df_2
0 1 2 3
bar one -0.354889 -1.283470 -0.977933 -0.601868
two -0.849186 -2.455453 0.790439 1.134282
baz one -0.143299 2.372440 -0.161744 0.919658
three -1.008426 -0.116167 -0.268608 0.840669
foo two -0.644028 0.447836 -0.576127 -0.891606
two -0.163497 -1.255801 -1.066442 0.624713
qux one -1.545989 -0.422028 -0.489222 -0.357954
two -1.202655 0.736047 -1.084002 0.732150
Now I query the second, dataframe, returning a subset of the original indexes现在我查询第二个数据帧,返回原始索引的一个子集
df_2_selection = df_2[(df_2 > 1).any(axis=1)]
print df_2_selection
0 1 2 3
bar two -0.849186 -2.455453 0.790439 1.134282
baz one -0.143299 2.372440 -0.161744 0.919658
I would like to find all the values in df_1 that match the indices found in df_2.我想在 df_1 中找到与 df_2 中找到的索引匹配的所有值。 The first two levels line up, but the third does not.
前两层对齐,但第三层不对齐。
This problem is easy when the indices line up, and would be solved by something like df_1.loc[df_2_selection.index] #this works if indexes are the same
当索引排列时,这个问题很容易,并且可以通过诸如
df_1.loc[df_2_selection.index] #this works if indexes are the same
类的东西来解决df_1.loc[df_2_selection.index] #this works if indexes are the same
Also I can find thhe values which match one of the levels with something like df_1[df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)]
but this does not solve the problem.我也可以找到与
df_1[df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)]
类的df_1[df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)]
级别匹配的值,但这并不能解决问题。
Chaining these statements together does not provide the desired functionality将这些语句链接在一起并不能提供所需的功能
df_1[(df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)) & (df_1.index.isin(df_2_selection.index.get_level_values(1),level = 1))]
I envision something along the lines of:我设想了一些类似的东西:
df_1_select = df_1[(df_1.index.isin(
df_2_selection.index.get_level_values([0,1]),level = [0,1])) #Doesnt Work
print df_1_select
0 1 2 3
bar two a -0.415263 1.006325 -0.077898 0.642134
baz one a -0.343707 0.474817 0.396702 -0.379066
I have tried many other methods, all of which have not worked exactly how I wanted.我尝试了许多其他方法,但所有这些方法都不是我想要的。 Thank you for your consideration.
谢谢您的考虑。
EDIT:编辑:
This df_1.loc[pd_idx[df_2_selection.index.get_level_values(0),df_2_selection.index.get_level_values(1),:],:]
Also does not work这个
df_1.loc[pd_idx[df_2_selection.index.get_level_values(0),df_2_selection.index.get_level_values(1),:],:]
也不起作用
I want only the rows where both levels match.我只想要两个级别匹配的行。 Not where either level match.
不是任何级别匹配的地方。
EDIT 2: This solution was posted by someone who has since deleted it编辑 2:此解决方案是由已删除它的人发布的
id=[x+([x for x in df_1.index.levels[-1]]) for x in df_2_selection.index.values]
pd.concat([df_1.loc[x] for x in id])
Which indeed does work!这确实有效! However on large dataframes it is prohibitively slow.
然而,在大型数据帧上,它的速度非常慢。 Any help with new methods / speedup is greatly appreciated.
非常感谢任何有关新方法/加速的帮助。
You can use reset_index()
and merge()
.您可以使用
reset_index()
和merge()
。
With df_2_selection
as: df_2_selection
为:
0 1 2 3
foo two -0.530151 0.932007 -1.255259 2.441294
qux one 2.006270 1.087412 -0.840916 -1.225508
Merge with:合并:
lvls = ["level_0","level_1"]
(df_1.reset_index()
.merge(df_2_selection.reset_index()[lvls], on=lvls)
.set_index(["level_0","level_1","level_2"])
.rename_axis([None]*3)
)
Output:输出:
0 1 2 3
foo two b -0.112696 0.287421 -0.380692 -0.035471
qux one b 0.658227 0.632667 -0.193224 1.073132
Note: The rename_axis()
part just removes the level names, eg level_0
.注意:
rename_axis()
部分只是删除级别名称,例如level_0
。 It's purely cosmetic, and not necessary to perform the actual matching procedure.这纯粹是装饰性的,不需要执行实际的匹配程序。
Try this:尝试这个:
pd.concat([
df_1.xs(key, drop_level=False)
for key in df_2_selection.index.values])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.