I am trying to subset a pandas DataFrame, based on a binned category. (I know you can subset based on the values themselves, this is just a representation of a different problem that I actually do need to bin the data!) I think I'm missing something about the subsetting, but can't find it out what in the documentation. Here is an example:
import numpy as np
import pandas as pd
np.random.seed(9876)
# Generating random data for binning.
bin_step = 0.5
random_data = np.random.uniform(low = 0, high = 10, size = 30)
# Generating bin ranges
bin_ranges = np.arange(start = random_data.min(),
stop = random_data.max() + random_data.max()*0.1,
step = bin_step)
# Cutting the random data into predefined bins.
bins = pd.cut(random_data.tolist(),
bin_ranges,
right = True,
include_lowest = True)
# Aggregating into a pandas DataFrame
random_data_pd = pd.Series(random_data.tolist(), name = 'values')
bins_transformed = pd.Series(bins, name = 'bins')
df = pd.concat([bins_transformed, random_data_pd], axis = 1)
When subsetting the bins, for example (5.086, 5.586]
, it's returning all False
. Why does this not subset?
df.bins == '(5.086, 5.586]' #returns all false.
If I am understanding correctly, the reason why is that you're using ==
to different types, pd.Interval
vs str
. Please check my example.
print(type(df.bins[0]))
<class 'pandas._libs.interval.Interval'>
print(df.bins)
print(df.bins == pd.Interval(5.1, 5.2))
0 (1.586, 2.086]
1 (6.086, 6.586]
2 (8.586, 9.086]
3 (7.586, 8.086]
4 (5.086, 5.586]
5 (0.585, 1.086]
6 (4.586, 5.086]
7 (1.086, 1.586]
8 (9.086, 9.586]
9 (4.586, 5.086]
10 (1.586, 2.086]
11 (1.086, 1.586]
12 (2.586, 3.086]
13 (2.586, 3.086]
14 (1.086, 1.586]
15 (8.086, 8.586]
16 (7.086, 7.586]
17 (6.586, 7.086]
18 (8.586, 9.086]
19 (7.586, 8.086]
20 (7.586, 8.086]
21 (0.585, 1.086]
22 (4.586, 5.086]
23 (9.086, 9.586]
24 (8.086, 8.586]
25 (6.586, 7.086]
26 (5.086, 5.586]
27 (6.586, 7.086]
28 (5.086, 5.586]
29 (9.086, 9.586]
Name: bins, dtype: category
Categories (19, interval[float64]): [(0.585, 1.086] < (1.086, 1.586] < (1.586, 2.086] <
(2.086, 2.586] ... (8.086, 8.586] < (8.586, 9.086] <
(9.086, 9.586] < (9.586, 10.086]]
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 True
27 False
28 True
29 False
Name: bins, dtype: bool
subset...
print(df[df.bins == pd.Interval(5.1, 5.2)])
bins values
4 (5.086, 5.586] 5.132422
26 (5.086, 5.586] 5.309666
28 (5.086, 5.586] 5.574920
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.