I'm trying to slice into a DataFrame that has a MultiIndex composed of an IntervalIndex and a regular Index. Example code:
from pandas import Interval as ntv
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))
Looks like this:
E var1
ntv id
(0, 10] 1 1 0.1
(0, 12] 2 0 0.5
What I would like to do is to slice into the DataFrame at a specific value and return all rows that has an interval that contains the value. Ex:
df.loc[4]
should return (trivially)
E var1
id
1 1 0.1
2 0 0.5
The problem is I keep getting a TypeError
about the index, and the docs show a similar operation (but on a single-level index) that does produce what I'm looking for.
TypeError: only integer scalar arrays can be converted to a scalar index
I've tried many things, nothing seems to work normally. I could include the id
column inside the dataframe, but I'd rather keep my index unique, and I would constantly be calling set_index('id')
.
I feel like either a) I'm missing something about MultiIndexes or b) there is a bug / ambiguity with using an IntervalIndex in a MultiIndex.
Since we are speaking intervals there is a method called get_loc
to find the rows that has the value in between the interval. To say what I mean :
from pandas import Interval as ntv
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))
df.iloc[(df.index.get_level_values(0).get_loc(4))]
E var1
ntv id
(0, 10] 1 1 0.1
(0, 12] 2 0 0.5
df.iloc[(df.index.get_level_values(0).get_loc(11))]
E var1
ntv id
(0, 12] 2 0 0.5
This also works if you have multiple rows of data for one inteval ie
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id': 3, 'var1': 0.1, 'ntv': ntv(0,10), 'E': 1},
{'id':2, 'var1': 0.5, 'ntv': ntv(0,12), 'E': 0}
], index=('ntv', 'id'))
df.iloc[(df.index.get_level_values(0).get_loc(4))]
E var1
ntv id
(0, 10] 1 1 0.1
3 1 0.1
(0, 12] 2 0 0.5
If you time this up with a list comprehension, this approach is way faster for large dataframes ie
ndf = pd.concat([df]*10000)
%%timeit
ndf.iloc[ndf.index.get_level_values(0).get_loc(4)]
10 loops, best of 3: 32.8 ms per loop
%%timeit
intervals = ndf.index.get_level_values(0)
mask = [4 in i for i in intervals]
ndf.loc[mask]
1 loop, best of 3: 193 ms per loop
So I did a bit of digging to try and understand the problem. If I try to run your code the following happens. You try to index into the index label with "slice(array([0, 1], dtype=int64), array([1, 2], dtype=int64), None)"
(when I say index_type I mean the Pandas datatype)
An index_type's label is a list of indices that map to the index_type's levels array. Here is an example from the documentation.
>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
>>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
MultiIndex(levels=[[1, 2], ['blue', 'red']],
labels=[[0, 0, 1, 1], [1, 0, 1, 0]],
names=['number', 'color'])
Notice how the second list in labels connects to the order of levels. level[1][1] is equal to red, and level[1][0] is equal to blue.
Anyhow, this is all to say that I don't believe intervalindex is meant to be used in an overlapping fashion. If you look at the orginal proposal for it https://github.com/pandas-dev/pandas/issues/7640
"A IntervalIndex would be a monotonic and non-overlapping one-dimensional array of intervals."
My suggestion is to move the interval into a column. You could probably write up a simple function with numba to test if a number is in each interval. Do you mind explaining the way you're benefiting from the interval?
Piggybacking off of @Dark's solution , Index.get_loc
just calls Index.get_indexer
under the hood, so it might be more efficient to call the underlying method when you don't have additional parameters and red tape.
idx = df.index.get_level_values(0)
df.iloc[idx.get_indexer([4])]
My originally proposed solution:
intervals = df.index.get_level_values(0)
mask = [4 in i for i in intervals]
df.loc[mask]
Regardless, it's certainly strange though that these return two different results, but does look like it has to do with the index being unique/monotonic/neither of the two:
df.reset_index(level=1, drop=True).loc[4] # good
df.loc[4] # TypeError
This is not really a solution and I don't fully understand but think it may have to do with your interval index not being monotonic (in that you have overlapping intervals). I guess that could in a sense be considered monotonic so perhaps alternately you could say the overlap means the index is not unique?
Anyway, check out this github issue:
ENH: Implement MultiIndex.is_monotonic_decreasing #17455
And here's an example with your data, but changing the intervals to be non-overlapping (0,6) & (7,12):
df = pd.DataFrame.from_records([
{'id': 1, 'var1': 0.1, 'ntv': ntv(0, 6), 'E': 1},
{'id': 2, 'var1': 0.5, 'ntv': ntv(7,12), 'E': 0}
], index=('ntv', 'id'))
Now, loc
works OK:
df.loc[4]
E var1
id
1 1 0.1
def check_value(num):
return df[[num in i for i in map(lambda x: x[0], df.index)]]
a = check_value(4)
a
>>
E var1
ntv id
(0, 10] 1 1 0.1
(0, 12] 2 0 0.5
if you want to drop the index level, you can add
a.index = a.droplevel(0)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.