简体   繁体   中英

Subset pandas DataFrame based on a bin

I am trying to subset a pandas DataFrame, based on a binned category. (I know you can subset based on the values themselves, this is just a representation of a different problem that I actually do need to bin the data!) I think I'm missing something about the subsetting, but can't find it out what in the documentation. Here is an example:

import numpy as np
import pandas as pd

np.random.seed(9876)

# Generating random data for binning.
bin_step = 0.5
random_data = np.random.uniform(low = 0, high = 10, size = 30)

# Generating bin ranges
bin_ranges = np.arange(start = random_data.min(), 
                           stop = random_data.max() + random_data.max()*0.1, 
                           step = bin_step)

# Cutting the random data into predefined bins.
bins = pd.cut(random_data.tolist(), 
              bin_ranges,
              right = True,
              include_lowest = True)

# Aggregating into a pandas DataFrame
random_data_pd = pd.Series(random_data.tolist(), name = 'values')
bins_transformed = pd.Series(bins, name = 'bins')

df = pd.concat([bins_transformed, random_data_pd], axis = 1)

When subsetting the bins, for example (5.086, 5.586] , it's returning all False . Why does this not subset?

df.bins == '(5.086, 5.586]' #returns all false.

If I am understanding correctly, the reason why is that you're using == to different types, pd.Interval vs str . Please check my example.

print(type(df.bins[0]))

<class 'pandas._libs.interval.Interval'>

print(df.bins)
print(df.bins == pd.Interval(5.1, 5.2))

0     (1.586, 2.086]
1     (6.086, 6.586]
2     (8.586, 9.086]
3     (7.586, 8.086]
4     (5.086, 5.586]
5     (0.585, 1.086]
6     (4.586, 5.086]
7     (1.086, 1.586]
8     (9.086, 9.586]
9     (4.586, 5.086]
10    (1.586, 2.086]
11    (1.086, 1.586]
12    (2.586, 3.086]
13    (2.586, 3.086]
14    (1.086, 1.586]
15    (8.086, 8.586]
16    (7.086, 7.586]
17    (6.586, 7.086]
18    (8.586, 9.086]
19    (7.586, 8.086]
20    (7.586, 8.086]
21    (0.585, 1.086]
22    (4.586, 5.086]
23    (9.086, 9.586]
24    (8.086, 8.586]
25    (6.586, 7.086]
26    (5.086, 5.586]
27    (6.586, 7.086]
28    (5.086, 5.586]
29    (9.086, 9.586]
Name: bins, dtype: category
Categories (19, interval[float64]): [(0.585, 1.086] < (1.086, 1.586] < (1.586, 2.086] <
                                     (2.086, 2.586] ... (8.086, 8.586] < (8.586, 9.086] <
                                     (9.086, 9.586] < (9.586, 10.086]]
0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17    False
18    False
19    False
20    False
21    False
22    False
23    False
24    False
25    False
26     True
27    False
28     True
29    False
Name: bins, dtype: bool

subset...

print(df[df.bins == pd.Interval(5.1, 5.2)])

              bins    values
4   (5.086, 5.586]  5.132422
26  (5.086, 5.586]  5.309666
28  (5.086, 5.586]  5.574920

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM