[英]Subset pandas DataFrame based on a bin
我正在尝试基于分类类别对pandas DataFrame进行子集化。 (我知道你可以根据自己的价值观子集,这是一个不同的问题,其实我需要斌数据!的只是一种表象)我觉得我失去了一些东西有关的子集,但无法找到它什么在文档中。 这是一个例子:
import numpy as np
import pandas as pd
np.random.seed(9876)
# Generating random data for binning.
bin_step = 0.5
random_data = np.random.uniform(low = 0, high = 10, size = 30)
# Generating bin ranges
bin_ranges = np.arange(start = random_data.min(),
stop = random_data.max() + random_data.max()*0.1,
step = bin_step)
# Cutting the random data into predefined bins.
bins = pd.cut(random_data.tolist(),
bin_ranges,
right = True,
include_lowest = True)
# Aggregating into a pandas DataFrame
random_data_pd = pd.Series(random_data.tolist(), name = 'values')
bins_transformed = pd.Series(bins, name = 'bins')
df = pd.concat([bins_transformed, random_data_pd], axis = 1)
例如(5.086, 5.586]
箱进行子集化时,它返回所有False
。为什么这不是子集?
df.bins == '(5.086, 5.586]' #returns all false.
如果我理解正确,原因是你使用==
到不同的类型, pd.Interval
vs str
。 请检查我的例子。
print(type(df.bins[0]))
<class 'pandas._libs.interval.Interval'>
print(df.bins)
print(df.bins == pd.Interval(5.1, 5.2))
0 (1.586, 2.086]
1 (6.086, 6.586]
2 (8.586, 9.086]
3 (7.586, 8.086]
4 (5.086, 5.586]
5 (0.585, 1.086]
6 (4.586, 5.086]
7 (1.086, 1.586]
8 (9.086, 9.586]
9 (4.586, 5.086]
10 (1.586, 2.086]
11 (1.086, 1.586]
12 (2.586, 3.086]
13 (2.586, 3.086]
14 (1.086, 1.586]
15 (8.086, 8.586]
16 (7.086, 7.586]
17 (6.586, 7.086]
18 (8.586, 9.086]
19 (7.586, 8.086]
20 (7.586, 8.086]
21 (0.585, 1.086]
22 (4.586, 5.086]
23 (9.086, 9.586]
24 (8.086, 8.586]
25 (6.586, 7.086]
26 (5.086, 5.586]
27 (6.586, 7.086]
28 (5.086, 5.586]
29 (9.086, 9.586]
Name: bins, dtype: category
Categories (19, interval[float64]): [(0.585, 1.086] < (1.086, 1.586] < (1.586, 2.086] <
(2.086, 2.586] ... (8.086, 8.586] < (8.586, 9.086] <
(9.086, 9.586] < (9.586, 10.086]]
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 True
27 False
28 True
29 False
Name: bins, dtype: bool
集...
print(df[df.bins == pd.Interval(5.1, 5.2)])
bins values
4 (5.086, 5.586] 5.132422
26 (5.086, 5.586] 5.309666
28 (5.086, 5.586] 5.574920
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.