[英]pandas dataframe drop rows by multiindex
I'd like to drop rows from a pandas dataframe using the MultiIndex value. 我想使用MultiIndex值从熊猫数据框中删除行。
I've tried quite a few things but I put below what I think was closer. 我已经尝试了很多事情,但是我把我认为更接近的东西放在下面。 (Actually I will explain the full problem since there might be an alternative solutions using a completely different approach). (实际上,我将解释整个问题,因为可能存在使用完全不同的方法的替代解决方案)。 From a correlation matrix, I'd like to get the pair of columns that correlate more. 从相关矩阵中,我想获得更多相关的一对列。 I use unstack
and put the results in a dataframe: 我使用unstack
并将结果放入数据框:
In [263]: corr_df = pd.DataFrame(total.corr().unstack())
Then get the higher correlations (actually I should get the negatives as well). 然后得到更高的相关性(实际上我也应该得到负值)。
In [264]: high = corr_df[(corr_df[0] > 0.5) & (corr_df[0] < 1.0)]
In [236]: print high
0
residual sugar density 0.552517
free sulfur dioxide total sulfur dioxide 0.720934
total sulfur dioxide free sulfur dioxide 0.720934
wine 0.700357
density residual sugar 0.552517
wine total sulfur dioxide 0.700357
Closed enough, but there are duplicates, that's actually the point of the correlation matrix. 足够封闭,但是有重复项,这实际上是相关矩阵的要点。 In order to clean them up, my idea is to iterate the high values to remove duplicates: 为了清理它们,我的想法是迭代高值以删除重复项:
In [267]:
for row in high.iterrows():
print row[0][0], ",", row[0][1]
print high.loc[row[0][1]].loc[row[0][0]].index
high.drop(high.loc[row[0][1]].loc[row[0][0]].index)
residual sugar , density
Int64Index([0], dtype='int64')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-267-1258da2a4772> in <module>()
2 print row[0][0], ",", row[0][1]
3 print high.loc[row[0][1]].loc[row[0][0]].index
----> 4 high.drop(high.loc[row[0][1]].loc[row[0][0]].index)
...
[huge stack of errors]
...
KeyError: 0
The method drop
is working perfectly when the index is normal (see drop ), but, how do I build the label
when I got a MultiIndex
? 当索引正常时, drop
方法可以完美地工作(请参阅drop ),但是,当我获得MultiIndex
时,如何构建label
?
corr_df = pd.DataFrame(
{'residual sugar': [1, 0, 0, 0.552517, 0],
'free sulfur dioxide': [0, 1, 0.720934, 0, 0],
'total sulfur dioxide': [0, 0.720934, 1, 0, 0.700357],
'density': [0.552517, 0, 0, 1, 0],
'wine': [0, 0, 0.700357, 0, 1]},
index=['residual sugar', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'wine']).unstack()
# Notice the slight modification to the original
high = corr_df[(corr_df > 0.5) & (corr_df < 1.0)]
# Sort by index, then values
high.sort_index()
high.sort()
# Drop every other value (e.g. just take the evens)
result = high.iloc[[count for count, _ in enumerate(high) if count % 2 == 0]]
>>> result
density residual sugar 0.552517
total sulfur dioxide wine 0.700357
free sulfur dioxide total sulfur dioxide 0.720934
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.