简体   繁体   English

索引数据帧后更新Pandas MultiIndex

[英]Updating Pandas MultiIndex after indexing the dataframe

Suppose I have the following dataframe: 假设我有以下数据帧:

arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
       ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.DataFrame(np.random.randn(8, 2), index=index, columns=[0, 1])
s

                     0         1
first second                    
bar   one    -0.012581  1.421286
      two    -0.048482 -0.153656
baz   one    -2.616540 -1.368694
      two    -1.989319  1.627848
foo   one    -0.404563 -1.099314
      two    -2.006166  0.867398
qux   one    -0.843150 -1.045291
      two     2.129620 -2.697217

I know select a sub-dataframe by indexing: 我知道通过索引选择一个子数据帧:

temp = s.loc[('bar', slice(None)), slice(None)].copy()
temp

                     0         1
first second                    
bar   one    -0.012581  1.421286
      two    -0.048482 -0.153656

However, if I look at the index, the values of the original index still appear: 但是,如果我查看索引,原始索引的值仍会显示:

temp.index
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
       labels=[[0, 0], [0, 1]],
       names=[u'first', u'second'])

This does not happen with normal dataframes. 普通数据帧不会发生这种情况。 If you index, the remaining copy (or even the view) contains only the selected index/columns. 如果您编制索引,则剩余副本(甚至视图)仅包含选定的索引/列。 This is annoying because I might often do lots of filtering on big dataframes and at the end I would like to know the index of what's left by just doing 这很烦人,因为我可能经常对大数据帧进行大量过滤,最后我想知道刚刚做的事情的索引。

df.index
df

This also happens for multiindex columns. 对于multiindex列也会发生这种情况。 Is there a proper way to update the index/columns and drop the empty entries? 有没有正确的方法来更新索引/列并删除空条目?

To be clear, I want the filtered dataframe to have the same structure (multiindex index and columns). 为了清楚起见,我希望过滤后的数据帧具有相同的结构(多索引索引和列)。 For example, I want to do: 例如,我想这样做:

 temp = s.loc[(('bar', 'foo'), slice(None)), :]

but the index still has 'baz' and 'qux' values: 但索引仍然有'baz'和'qux'值:

MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
       labels=[[0, 0, 2, 2], [0, 1, 0, 1]],
       names=[u'first', u'second'])

To make clear the effect I would like to see, I wrote this snippet to eliminate redundant entries: 为了明确我希望看到的效果,我写了这个片段以消除冗余条目:

import pandas as pd
def update_multiindex(df):
    if isinstance(df.columns, pd.MultiIndex):
        new_df = {key: df.loc[:, key] for key in df.columns if not df.loc[:,     key].empty}    
        new_df = pd.DataFrame(new_df)
    else:
        new_df = df.copy()
    if isinstance(df.index, pd.MultiIndex):
        new_df = {key: new_df.loc[key, :] for key in new_df.index if not     new_df.loc[key, :].empty}
        new_df = pd.DataFrame(new_df).T
    return new_df

temp = update_multiindex(temp).index
temp
MultiIndex(levels=[[u'bar', u'foo'], [u'one', u'two']],
       labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Two points. 两点。 First, I think you may want to do something that is actually bad for you. 首先,我想你可能想做一些对你不利的事情。 I know it's annoying that you have a lot of extra cruft in your filtered indices, but if you rebuild the indices to exclude the missing categorical values, then your new indices will be incompatible with each other and the original index. 我知道你在过滤的索引中有很多额外的错误很烦人,但是如果重建索引以排除缺少的分类值,那么你的新索引将彼此不兼容并且与原始索引不兼容。

That said, I suspect (but do not know) that MultiIndex used this way is built on top of CategoricalIndex , which has the method remove_unused_levels() . 也就是说,我怀疑(但不知道)使用这种方式的MultiIndex是建立在CategoricalIndex之上的,它具有remove_unused_levels()方法。 It may be wrapped by MultiIndex , but I cannot tell, because... 它可能被MultiIndex包裹,但我不知道,因为......

Second, MultiIndex is notably missing from the pandas API documentation . 其次, pandas API文档中特别缺少MultiIndex I do not use MultiIndex , but you might consider looking for and/or opening a ticket on GitHub about this if you do use it regularly. 我不使用MultiIndex ,但如果你经常使用它,你可能会考虑在GitHub上寻找和/或打开一张关于这个的票。 Beyond that, you may have to grunnel through the source code if you want to find exact information on the features available with MultiIndex . 除此之外,如果您想要找到有关MultiIndex可用功能的确切信息,您可能需要通过源代码

If I understand correctly your usage pattern you may be able to get the best of both worlds. 如果我正确理解您的使用模式,您可能能够充分利用这两个方面。 I'm focusing on: 我专注于:

This is annoying because I might often do lots of filtering on big dataframes and at the end I would like to know the index of what's left by just doing 这很烦人,因为我可能经常对大数据帧进行大量过滤,最后我想知道刚刚做的事情的索引。

df.index df df.index df

This also happens for multiindex columns. 对于multiindex列也会发生这种情况。 Is there a proper way to update the index/columns and drop the empty entries? 有没有正确的方法来更新索引/列并删除空条目?

Consideration (1) is that you want to know the index of what's left. 考虑(1)是你想知道剩下的索引。 Consideration (2) is that as mentioned above, if you trim the multiindex you can't merge any data back into your original, and also its a bunch of nonobvious steps that aren't really encouraged. 考虑(2)就是如上所述,如果你修剪多索引,你就不能将任何数据合并回你的原始数据,还有一些非常明显的步骤,这些步骤并不是真正鼓励的。

The underlying fundamental is that index does NOT return updated contents for a multiindex if any rows or columns have been deleted and this is not considered a bug because that's not the approved use of MultiIndexes (read more: github.com/pydata/pandas/issues/3686 ). 根本的基本原则是,如果删除了任何行或列,索引不会返回多索引的更新内容,这不会被视为错误,因为这不是MultiIndexes的批准使用(更多信息: github.com/pydata/pandas/issues / 3686 )。 The valid API access for the current contents of a MultiIndex is get_level_values. MultiIndex的当前内容的有效API访问是get_level_values。

So would it fit your needs to adjust your practice to use this? 那么它是否适合您的需要调整您的练习来使用它?

df.index.get_level_values(-put your level name or number here-)

For Multiindexes this is the approved API access technique and there are some good reasons for this. 对于Multiindexes,这是经过批准的API访问技术,并且有一些很好的理由。 If you use get_level_values instead of just .index you'll be able to get the current contents while ALSO preserving all the information in case you want to re-merge modified data or otherwise match against the original indices for comparisons, grouping, etc... 如果您使用get_level_values而不仅仅是.index,您将能够获取当前内容,同时保留所有信息,以防您想要重新合并修改后的数据或以其他方式匹配原始索引进行比较,分组等。 。

Does that fit your needs? 这符合您的需求吗?

Try using droplevel . 尝试使用droplevel

temp.index = temp.index.droplevel()

>>> temp
               0         1
second                    
one     0.450819 -1.071271
two    -0.371563  0.411808

>>> temp.index
Index([u'one', u'two'], dtype='object')

When dealing with columns, it's the same thing: 处理列时,它是一样的:

df.columns = df.columns.droplevel()

You can also use xs and set the drop_level parameter to True (default value is False): 您还可以使用xs并将drop_level参数设置为True(默认值为False):

>>> s.xs('bar', drop_level=True) 
               0         1
second                    
one     0.450819 -1.071271
two    -0.371563  0.411808

There is a difference between the index of s and the index of temp : s的索引与temp的索引之间存在差异:

In [25]: s.index
Out[25]: 
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=[u'first', u'second'])

In [26]: temp.index
Out[26]: 
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
           labels=[[0, 0], [0, 1]],
           names=[u'first', u'second'])

Notices that the labels in the MultiIndex are different. 注意MultiIndex中的labels是不同的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM