简体   繁体   English

熊猫数据框通过多索引删除行

[英]pandas dataframe drop rows by multiindex

I'd like to drop rows from a pandas dataframe using the MultiIndex value. 我想使用MultiIndex值从熊猫数据框中删除行。

I've tried quite a few things but I put below what I think was closer. 我已经尝试了很多事情,但是我把我认为更接近的东西放在下面。 (Actually I will explain the full problem since there might be an alternative solutions using a completely different approach). (实际上,我将解释整个问题,因为可能存在使用完全不同的方法的替代解决方案)。 From a correlation matrix, I'd like to get the pair of columns that correlate more. 从相关矩阵中,我想获得更多相关的一对列。 I use unstack and put the results in a dataframe: 我使用unstack并将结果放入数据框:

In [263]: corr_df = pd.DataFrame(total.corr().unstack())

Then get the higher correlations (actually I should get the negatives as well). 然后得到更高的相关性(实际上我也应该得到负值)。

In [264]: high = corr_df[(corr_df[0] > 0.5) & (corr_df[0] < 1.0)]

In [236]: print high
                                                  0
residual sugar       density               0.552517
free sulfur dioxide  total sulfur dioxide  0.720934
total sulfur dioxide free sulfur dioxide   0.720934
                     wine                  0.700357
density              residual sugar        0.552517
wine                 total sulfur dioxide  0.700357

Closed enough, but there are duplicates, that's actually the point of the correlation matrix. 足够封闭,但是有重复项,这实际上是相关矩阵的要点。 In order to clean them up, my idea is to iterate the high values to remove duplicates: 为了清理它们,我的想法是迭代高值以删除重复项:

In [267]:
for row in high.iterrows():
    print row[0][0], ",", row[0][1]
    print high.loc[row[0][1]].loc[row[0][0]].index
    high.drop(high.loc[row[0][1]].loc[row[0][0]].index)
residual sugar , density
Int64Index([0], dtype='int64')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-267-1258da2a4772> in <module>()
      2     print row[0][0], ",", row[0][1]
      3     print high.loc[row[0][1]].loc[row[0][0]].index
----> 4     high.drop(high.loc[row[0][1]].loc[row[0][0]].index)

...
[huge stack of errors]
...
KeyError: 0

The method drop is working perfectly when the index is normal (see drop ), but, how do I build the label when I got a MultiIndex ? 当索引正常时, drop方法可以完美地工作(请参阅drop ),但是,当我获得MultiIndex时,如何构建label

corr_df = pd.DataFrame(
{'residual sugar': [1, 0, 0, 0.552517, 0], 
'free sulfur dioxide': [0, 1, 0.720934, 0, 0], 
'total sulfur dioxide': [0, 0.720934, 1, 0, 0.700357],
'density': [0.552517, 0, 0, 1, 0],
'wine': [0, 0, 0.700357, 0, 1]}, 
index=['residual sugar', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'wine']).unstack()

# Notice the slight modification to the original
high = corr_df[(corr_df > 0.5) & (corr_df < 1.0)]

# Sort by index, then values
high.sort_index()
high.sort()

# Drop every other value (e.g. just take the evens)
result = high.iloc[[count for count, _ in enumerate(high) if count % 2 == 0]]
>>> result
density               residual sugar          0.552517
total sulfur dioxide  wine                    0.700357
free sulfur dioxide   total sulfur dioxide    0.720934

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM