简体   繁体   English

Pandas:当某些级别不匹配时,将一个多索引数据帧与另一个多索引切片

[英]Pandas: slice one multiindex dataframe with multiindex of another when some levels don't match

I have two multiindexed dataframes, one with two levels and one with three.我有两个多索引数据框,一个有两个级别,一个有三个级别。 The first two levels match in both dataframes.前两个级别在两个数据帧中都匹配。 I would like to find all values from the first dataframe where the first two index levels match in the second dataframe.我想从第一个数据帧中找到前两个索引级别在第二个数据帧中匹配的所有值。 The second data frame does not have a third level.第二个数据帧没有第三级。

The closest answer I have found is this: How to slice one MultiIndex DataFrame with the MultiIndex of another -- however the setup is slightly different and doesn't seem to translate to this case.我找到的最接近的答案是: 如何使用另一个 MultiIndex 对一个 MultiIndex DataFrame 进行切片——但是设置略有不同,似乎没有转化为这种情况。

Consider the setup below考虑下面的设置

array_1 = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']),
np.array(['a', 'a','a', 'a','b','b','b','b' ])]

array_2 = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
      np.array(['one', 'two', 'three', 'one', 'two', 'two', 'one', 'two'])]

df_1 = pd.DataFrame(np.random.randn(8,4), index=array_1).sort_index()

print df_1
                  0         1         2         3
bar one a  1.092651 -0.325324  1.200960 -0.790002
    two a -0.415263  1.006325 -0.077898  0.642134
baz one a -0.343707  0.474817  0.396702 -0.379066
    two a  0.315192 -1.548431 -0.214253 -1.790330
foo one b  1.022050 -2.791862  0.172165  0.924701
    two b  0.622062 -0.193056 -0.145019  0.763185
qux one b -1.241954 -1.270390  0.147623 -0.301092
    two b  0.778022  1.450522  0.683487 -0.950528

df_2 = pd.DataFrame(np.random.randn(8,4), index=array_2).sort_index()

print df_2

                  0         1         2         3
bar one   -0.354889 -1.283470 -0.977933 -0.601868
    two   -0.849186 -2.455453  0.790439  1.134282
baz one   -0.143299  2.372440 -0.161744  0.919658
    three -1.008426 -0.116167 -0.268608  0.840669
foo two   -0.644028  0.447836 -0.576127 -0.891606
    two   -0.163497 -1.255801 -1.066442  0.624713
qux one   -1.545989 -0.422028 -0.489222 -0.357954
    two   -1.202655  0.736047 -1.084002  0.732150

Now I query the second, dataframe, returning a subset of the original indexes现在我查询第二个数据帧,返回原始索引的一个子集

df_2_selection = df_2[(df_2 > 1).any(axis=1)]
print df_2_selection

                0         1         2         3
bar two -0.849186 -2.455453  0.790439  1.134282
baz one -0.143299  2.372440 -0.161744  0.919658

I would like to find all the values in df_1 that match the indices found in df_2.我想在 df_1 中找到与 df_2 中找到的索引匹配的所有值。 The first two levels line up, but the third does not.前两层对齐,但第三层不对齐。

This problem is easy when the indices line up, and would be solved by something like df_1.loc[df_2_selection.index] #this works if indexes are the same当索引排列时,这个问题很容易,并且可以通过诸如df_1.loc[df_2_selection.index] #this works if indexes are the same类的东西来解决df_1.loc[df_2_selection.index] #this works if indexes are the same

Also I can find thhe values which match one of the levels with something like df_1[df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)] but this does not solve the problem.我也可以找到与df_1[df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)]类的df_1[df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)]级别匹配的值,但这并不能解决问题。

Chaining these statements together does not provide the desired functionality将这些语句链接在一起并不能提供所需的功能

df_1[(df_1.index.isin(df_2_selection.index.get_level_values(0),level = 0)) & (df_1.index.isin(df_2_selection.index.get_level_values(1),level = 1))]

I envision something along the lines of:我设想了一些类似的东西:

df_1_select = df_1[(df_1.index.isin(
    df_2_selection.index.get_level_values([0,1]),level = [0,1])) #Doesnt Work

print df_1_select

                  0         1         2         3
bar two a -0.415263  1.006325 -0.077898  0.642134
baz one a -0.343707  0.474817  0.396702 -0.379066

I have tried many other methods, all of which have not worked exactly how I wanted.我尝试了许多其他方法,但所有这些方法都不是我想要的。 Thank you for your consideration.谢谢您的考虑。

EDIT:编辑:

This df_1.loc[pd_idx[df_2_selection.index.get_level_values(0),df_2_selection.index.get_level_values(1),:],:] Also does not work这个df_1.loc[pd_idx[df_2_selection.index.get_level_values(0),df_2_selection.index.get_level_values(1),:],:]也不起作用

I want only the rows where both levels match.我只想要两个级别匹配的行。 Not where either level match.不是任何级别匹配的地方。

EDIT 2: This solution was posted by someone who has since deleted it编辑 2:此解决方案是由已删除它的人发布的

id=[x+([x for x in df_1.index.levels[-1]]) for x in df_2_selection.index.values]

pd.concat([df_1.loc[x] for x in id])

Which indeed does work!这确实有效! However on large dataframes it is prohibitively slow.然而,在大型数据帧上,它的速度非常慢。 Any help with new methods / speedup is greatly appreciated.非常感谢任何有关新方法/加速的帮助。

You can use reset_index() and merge() .您可以使用reset_index()merge()

With df_2_selection as: df_2_selection为:

                0         1         2         3
foo two -0.530151  0.932007 -1.255259  2.441294
qux one  2.006270  1.087412 -0.840916 -1.225508

Merge with:合并:

lvls = ["level_0","level_1"]

(df_1.reset_index()
 .merge(df_2_selection.reset_index()[lvls], on=lvls)
 .set_index(["level_0","level_1","level_2"])
 .rename_axis([None]*3)
)

Output:输出:

                  0         1         2         3
foo two b -0.112696  0.287421 -0.380692 -0.035471
qux one b  0.658227  0.632667 -0.193224  1.073132

Note: The rename_axis() part just removes the level names, eg level_0 .注意: rename_axis()部分只是删除级别名称,例如level_0 It's purely cosmetic, and not necessary to perform the actual matching procedure.这纯粹是装饰性的,不需要执行实际的匹配程序。

Try this:尝试这个:

pd.concat([
    df_1.xs(key, drop_level=False)
    for key in df_2_selection.index.values])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM