简体   繁体   English

从2列创建分类数据-Python Pandas

[英]creating categorical data from 2 columns - Python Pandas

I have a problem with creating a dataframe which holds a time interval where a measurement of temperature is in. As for now the dataframe has its index as time and another column as the measurements and i would like to have the time converted to an interval of 12 hours and the measurement to be the mean of the values in that timelapse. 我在创建一个保存时间间隔的数据帧时遇到问题。目前,该数据帧的索引为时间,另一列为测量值,我希望将时间转换为时间间隔12小时,测量值是该间隔中的平均值。

                         measurement
time
2016-11-04 08:49:25    17.730000
2016-11-04 10:23:52    18.059999
2016-11-04 11:02:09    18.370001
2016-11-04 12:04:20    18.090000
2016-11-04 14:26:43    18.320000

so instead of having each time related to the measurement i want the mean of the value of let's say 12 hours like this: 因此,我不想让每次都与测量相关,我想要这样的平均值是12小时:

                                              measurement
time
2016-11-04 00:00:00 - 2016-11-04 12:00:00     17.730000
2016-11-04 12:00:00 - 2016-11-05 00:00:00     18.059999
2016-11-05 00:00:00 - 2016-11-05 12:00:00     18.370001
2016-11-05 12:00:00 - 2016-11-06 00:00:00     18.090000
2016-11-06 00:00:00 - 2016-11-06 12:00:00     18.320000

is there an easy way to do this with pandas? 有没有简单的方法可以做到这一点?

Later i would like to convert the measurements into intervals as well so that the data becomes boolean like this: 后来我也想将测量值转换为间隔,以便数据变为布尔值,如下所示:

                                              17.0-18.0   18.0-19.0  19.0-20
time
2016-11-04 00:00:00 - 2016-11-04 12:00:00         1           0         0
2016-11-04 12:00:00 - 2016-11-05 00:00:00         0           1         0
2016-11-05 00:00:00 - 2016-11-05 12:00:00         0           1         0
2016-11-05 12:00:00 - 2016-11-06 00:00:00         0           1         0
2016-11-06 00:00:00 - 2016-11-06 12:00:00         0           1         0

EDIT: I used a solution first posted by Coldspeed 编辑:我使用了由Coldspeed首先发布的解决方案

df = pd.DataFrame({'timestamp':time.values, 'readings':readings.values})
df = df.groupby(pd.Grouper(key='timestamp', freq='12H'))['readings'].mean()
v = pd.cut(df, bins=[17,18,19,20,21,22,23,24,25,26,27,28], labels=['17-18','18-19','19-20','20-21','21-22','22-23','23-24','24-25','25-26','26-27','27-28'])

I know that the bins and labels could have been done but a for loop but this is just a quick fix. 我知道垃圾箱和标签本可以完成,但是要进行for循环,但这只是一个快速解决方案。 the groupby function which groups the value of 'timestamp' in the frequency of 12 hours and gets the readings mean value in the timelapse. groupby函数将“ timestamp”的值按12小时的频率进行分组,并获得timelapse中的读数平均值。

Then the cut function is used to categorize the means into their categories. 然后使用cut函数将均值归类。

result: 结果:

                     17-18  18-19  19-20  20-21  21-22  22-23  23-24  24-25  \
timestamp
2016-11-04 00:00:00      0      1      0      0      0      0      0      0
2016-11-04 12:00:00      0      1      0      0      0      0      0      0
2016-11-05 00:00:00      0      0      0      0      0      0      0      0
2016-11-05 12:00:00      1      0      0      0      0      0      0      0
2016-11-06 00:00:00      1      0      0      0      0      0      0      0
2016-11-06 12:00:00      0      0      0      0      0      0      0      0
2016-11-07 00:00:00      0      1      0      0      0      0      0      0
2016-11-07 12:00:00      1      0      0      0      0      0      0      0
2016-11-08 00:00:00      0      0      0      0      0      0      0      0
2016-11-08 12:00:00      0      0      0      0      0      0      0      0
2016-11-09 00:00:00      1      0      0      0      0      0      0      0
2016-11-09 12:00:00      1      0      0      0      0      0      0      0
2016-11-10 00:00:00      0      1      0      0      0      0      0      0
2016-11-10 12:00:00      0      0      0      0      0      0      0      0
2016-11-11 00:00:00      0      0      0      0      0      0      0      0
2016-11-11 12:00:00      0      0      0      0      0      0      0      0
2016-11-12 00:00:00      0      0      0      0      0      0      0      0
2016-11-12 12:00:00      0      0      0      0      0      0      0      0
2016-11-13 00:00:00      0      0      0      0      0      0      0      0
2016-11-13 12:00:00      0      0      0      0      0      0      0      0
2016-11-14 00:00:00      0      0      0      0      0      0      0      0
2016-11-14 12:00:00      0      1      0      0      0      0      0      0
2016-11-15 00:00:00      0      0      0      1      0      0      0      0
2016-11-15 12:00:00      0      0      0      0      0      1      0      0
2016-11-16 00:00:00      0      0      0      0      0      0      1      0
2016-11-16 12:00:00      0      0      0      0      0      0      0      0
2016-11-17 00:00:00      0      0      0      0      0      0      0      0

Use pd.cut + pd.get_dummies : 使用pd.cut + pd.get_dummies

v = pd.cut(df.measurement, bins=[17, 18, 19, 20], labels=['17-18', '18-19', '19-20'])
pd.get_dummies(v)

   17-18  18-19  19-20
0      1      0      0
1      0      1      0
2      0      1      0
3      0      1      0
4      0      1      0

IIUC you want to resample by 12 hour chunks, then create dummies. 您想按12小时的时间块对IIUC进行重新采样,然后创建假人。
pd.cut is a perfectly acceptable way to cut the resultant data into bins. pd.cut是将结果数据切成bin的一种完全可接受的方法。
However, I use np.searchsorted to accomplish the task. 但是,我使用np.searchsorted完成任务。

bins = np.array([17, 18, 19, 20])
labels = np.array(['<17', '17-18', '18-19', '19-20', '>20'])
resampled = df.resample('12H').measurement.mean()
pd.get_dummies(pd.Series(labels[bins.searchsorted(resampled.values)], resampled.index))

                     17-18  18-19  19-20  >20
2018-03-20 00:00:00      0      1      0    0
2018-03-20 12:00:00      1      0      0    0
2018-03-21 00:00:00      0      1      0    0
2018-03-21 12:00:00      0      0      0    1
2018-03-22 00:00:00      0      0      1    0
2018-03-22 12:00:00      0      0      0    1

Setup 设定

np.random.seed(int(np.pi * 1E6))

tidx = pd.date_range(pd.Timestamp('now'), freq='3H', periods=20)
df = pd.DataFrame(dict(measurement=np.random.rand(len(tidx)) * 6 + 17), tidx)

df

                            measurement
2018-03-20 06:58:30.484383    17.960744
2018-03-20 09:58:30.484383    18.572100
2018-03-20 12:58:30.484383    17.646766
2018-03-20 15:58:30.484383    19.025463
2018-03-20 18:58:30.484383    17.521399
2018-03-20 21:58:30.484383    17.318663
2018-03-21 00:58:30.484383    19.388553
2018-03-21 03:58:30.484383    19.520969
2018-03-21 06:58:30.484383    19.060640
2018-03-21 09:58:30.484383    17.106034
2018-03-21 12:58:30.484383    22.887546
2018-03-21 15:58:30.484383    18.437271
2018-03-21 18:58:30.484383    18.426362
2018-03-21 21:58:30.484383    20.558928
2018-03-22 00:58:30.484383    22.555121
2018-03-22 03:58:30.484383    17.139489
2018-03-22 06:58:30.484383    17.209499
2018-03-22 09:58:30.484383    19.466367
2018-03-22 12:58:30.484383    21.765692
2018-03-22 15:58:30.484383    19.680785

You can use pd.cut() + pd.get_dummies() : 您可以使用pd.cut() + pd.get_dummies()

df["measurement"] = pd.cut(df["measurement"], bins=[17.0,18.0,19.0,20.0])
dummies = pd.get_dummies(df["measurement"])

对于第一个问题:您可以使用pandas.TimeGrouper每12小时(或任何其他频率)进行分组,然后取各组的平均值。

df.groupby([pd.TimeGrouper(freq='12H')]).mean()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM