简体   繁体   English

子集具有索引的pandas数据框,该索引包含重复项

[英]Subset a pandas dataframe that has an index that contains duplicates

For the data frame: 对于数据框:

df = pd.DataFrame({
    'key': [1,2,3,4,5, np.nan, np.nan],
    'value': ['one','two','three', 'four', 'five', 'six', 'seven']
}).set_index('key')

That looks like this: 看起来像这样:

        value
key     
1.0     one
2.0     two
3.0     three
4.0     four
5.0     five
NaN     six
NaN     seven

I would like to subset it to: 我想将其子集为:

    value
key     
1   one
1   one
6   NaN

This produces a warning: 这会产生警告:

df.loc[[1,1,6],]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

This produces an error: 这会产生一个错误:

df.reindex([1, 1, 6])

ValueError: cannot reindex from a duplicate axis

How to do it while referencing a missing index and without using apply? 如何在引用缺少的索引而不使用Apply的情况下执行此操作?

The thing is you have duplicated values NaN s as indexes. 问题是您有重复的值NaN作为索引。 You should disconsider those when reindexing because they are duplicates and there is ambiguity on which value use in the new index. 您应该在重新编制索引时不要考虑那些索引,因为它们是重复的,并且在新索引中使用哪个值有歧义。

df.loc[df.index.dropna()].reindex([1, 1, 6])

    value
key 
1   one
1   one
6   NaN

For a generalized solution, use duplicated 对于通用解决方案,请使用duplicated

df.loc[~df.index.duplicated(keep=False)].reindex([1, 1, 6])

If you want to keep duplicated indexes and use reindex , you'll fail. 如果要保留重复的索引并使用reindex ,则会失败。 This has actually been asked before a couple of times 实际上已经被问过几次了

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM