如何正确使用包含间隔的多索引的Pandas Dataframe？

Question

I'm trying to slice into a DataFrame that has a MultiIndex composed of an IntervalIndex and a regular Index. 我正在尝试切入具有由IntervalIndex和常规索引组成的MultiIndex的DataFrame。 Example code: 示例代码：

from pandas import Interval as ntv

df = pd.DataFrame.from_records([
   {'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1}, 
   {'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))

Looks like this: 看起来像这样：

            E  var1
ntv     id
(0, 10] 1   1   0.1
(0, 12] 2   0   0.5

What I would like to do is to slice into the DataFrame at a specific value and return all rows that has an interval that contains the value. 我想要做的是以特定值切入DataFrame并返回具有包含该值的间隔的所有行。 Ex: 例如：

df.loc[4]

should return (trivially) 应该回归（平凡）

    E  var1
id
1   1   0.1
2   0   0.5

The problem is I keep getting a TypeError about the index, and the docs show a similar operation (but on a single-level index) that does produce what I'm looking for. 问题是我不断收到关于索引的TypeError ，并且文档显示了类似的操作（但是在单级索引上），它确实产生了我正在寻找的东西。

TypeError: only integer scalar arrays can be converted to a scalar index

I've tried many things, nothing seems to work normally. 我尝试了很多东西，似乎没有什么能正常工作。 I could include the id column inside the dataframe, but I'd rather keep my index unique, and I would constantly be calling set_index('id') . 我可以在数据帧中包含id列，但我宁愿保持我的索引唯一，我会不断调用set_index('id') 。

I feel like either a) I'm missing something about MultiIndexes or b) there is a bug / ambiguity with using an IntervalIndex in a MultiIndex. 我觉得要么a）我缺少关于MultiIndexes的东西，或者b）在MultiIndex中使用IntervalIndex存在错误/歧义。

Answer 1

Since we are speaking intervals there is a method called get_loc to find the rows that has the value in between the interval. 由于我们是发言间隔，因此有一个名为get_loc的方法来查找具有介于该间隔之间的值的行。 To say what I mean : 说出我的意思：

from pandas import Interval as ntv

df = pd.DataFrame.from_records([
   {'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1}, 
   {'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))

df.iloc[(df.index.get_level_values(0).get_loc(4))]
            E  var1
ntv     id         
(0, 10] 1   1   0.1
(0, 12] 2   0   0.5

df.iloc[(df.index.get_level_values(0).get_loc(11))]
             E  var1
ntv     id         
(0, 12] 2   0   0.5

This also works if you have multiple rows of data for one inteval ie 如果您有一个inteval的多行数据，这也适用

df = pd.DataFrame.from_records([
   {'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1}, 
   {'id': 3, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
   {'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))

df.iloc[(df.index.get_level_values(0).get_loc(4))]

            E  var1
ntv     id         
(0, 10] 1   1   0.1
        3   1   0.1
(0, 12] 2   0   0.5

If you time this up with a list comprehension, this approach is way faster for large dataframes ie 如果你用列表理解来计算时间，这种方法对于大型数据帧来说更快，即

ndf = pd.concat([df]*10000)

%%timeit
ndf.iloc[ndf.index.get_level_values(0).get_loc(4)]
10 loops, best of 3: 32.8 ms per loop

%%timeit
intervals = ndf.index.get_level_values(0)
mask = [4 in i for i in intervals]
ndf.loc[mask]
1 loop, best of 3: 193 ms per loop

Answer 2

So I did a bit of digging to try and understand the problem. 所以我做了一些挖掘试图理解问题。 If I try to run your code the following happens. 如果我尝试运行您的代码，则会发生以下情况。 You try to index into the index label with "slice(array([0, 1], dtype=int64), array([1, 2], dtype=int64), None)" 您尝试使用“slice（array（[0,1]，dtype = int64），array（[1,2]，dtype = int64），None）索引索引标签”

(when I say index_type I mean the Pandas datatype) （当我说index_type我指的是Pandas数据类型）

An index_type's label is a list of indices that map to the index_type's levels array. index_type的标签是映射到index_type的levels数组的索引列表。 Here is an example from the documentation. 以下是文档中的示例。

   >>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
    >>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
    MultiIndex(levels=[[1, 2], ['blue', 'red']],
           labels=[[0, 0, 1, 1], [1, 0, 1, 0]],
           names=['number', 'color'])

Notice how the second list in labels connects to the order of levels. 注意标签中的第二个列表如何连接到级别的顺序。 level[1][1] is equal to red, and level[1][0] is equal to blue. level [1] [1]等于红色，等级[1] [0]等于蓝色。

Anyhow, this is all to say that I don't believe intervalindex is meant to be used in an overlapping fashion. 无论如何，这就是说我不相信intervalindex意味着以重叠的方式使用。 If you look at the orginal proposal for it https://github.com/pandas-dev/pandas/issues/7640 如果你看看它的原始提案https://github.com/pandas-dev/pandas/issues/7640

"A IntervalIndex would be a monotonic and non-overlapping one-dimensional array of intervals." “IntervalIndex将是一个单调且不重叠的一维间隔阵列。”

My suggestion is to move the interval into a column. 我的建议是将间隔移动到一列。 You could probably write up a simple function with numba to test if a number is in each interval. 您可以用numba编写一个简单的函数来测试每个区间中是否有数字。 Do you mind explaining the way you're benefiting from the interval? 你介意解释你从间隔中受益的方式吗？

Answer 3

Piggybacking off of @Dark's solution , Index.get_loc just calls Index.get_indexer under the hood, so it might be more efficient to call the underlying method when you don't have additional parameters and red tape. Index.get_loc @ Dark的解决方案， Index.get_loc只是在引擎盖下调用Index.get_indexer ，因此在没有其他参数和繁文缛节时调用底层方法可能更有效。

idx = df.index.get_level_values(0)
df.iloc[idx.get_indexer([4])]

My originally proposed solution: 我最初建议的解决方案

intervals = df.index.get_level_values(0)
mask = [4 in i for i in intervals]
df.loc[mask]

Regardless, it's certainly strange though that these return two different results, but does look like it has to do with the index being unique/monotonic/neither of the two: 无论如何，虽然它们会返回两个不同的结果，但它看起来确实与索引是唯一的/单调的/两者都不相符，这当然很奇怪：

df.reset_index(level=1, drop=True).loc[4] # good
df.loc[4]  # TypeError

Answer 4

This is not really a solution and I don't fully understand but think it may have to do with your interval index not being monotonic (in that you have overlapping intervals). 这不是一个真正的解决方案，我不完全理解，但认为它可能与你的间隔索引不单调（因为你有重叠的间隔）。 I guess that could in a sense be considered monotonic so perhaps alternately you could say the overlap means the index is not unique? 我想在某种意义上可以认为是单调的，所以也许你可以说重叠意味着指数不是唯一的？

Anyway, check out this github issue: 无论如何，看看这个github问题：

ENH: Implement MultiIndex.is_monotonic_decreasing #17455 ENH：实现MultiIndex.is_monotonic_decreasing＃17455

And here's an example with your data, but changing the intervals to be non-overlapping (0,6) & (7,12): 以下是您的数据示例，但将间隔更改为非重叠（0,6）和（7,12）：

df = pd.DataFrame.from_records([
   {'id': 1, 'var1': 0.1, 'ntv': ntv(0, 6), 'E': 1}, 
   {'id': 2, 'var1': 0.5, 'ntv': ntv(7,12), 'E': 0}
], index=('ntv', 'id'))

Now, loc works OK: 现在， loc工作正常：

df.loc[4]

    E  var1
id         
1   1   0.1

Answer 5

def check_value(num):
    return df[[num in i for i in map(lambda x: x[0], df.index)]] 

a = check_value(4)
a
>> 
            E  var1
ntv     id         
(0, 10] 1   1   0.1
(0, 12] 2   0   0.5

if you want to drop the index level, you can add 如果要删除索引级别，可以添加

a.index = a.droplevel(0)

如何正确使用包含间隔的多索引的Pandas Dataframe？

问题描述

5 个解决方案

解决方案1
6 已采纳 2017-12-07 06:19:59

解决方案2
3 2017-12-07 05:36:53

解决方案3
2 2017-12-03 20:34:24

解决方案4
2 2017-12-07 03:39:12

解决方案5
0 2017-12-08 19:42:56

如何正确使用包含间隔的多索引的Pandas Dataframe？

问题描述

5 个解决方案

解决方案1 6 已采纳 2017-12-07 06:19:59

解决方案2 3 2017-12-07 05:36:53

解决方案3 2 2017-12-03 20:34:24

解决方案4 2 2017-12-07 03:39:12

解决方案5 0 2017-12-08 19:42:56

解决方案1
6 已采纳 2017-12-07 06:19:59

解决方案2
3 2017-12-07 05:36:53

解决方案3
2 2017-12-03 20:34:24

解决方案4
2 2017-12-07 03:39:12

解决方案5
0 2017-12-08 19:42:56