[英]creating categorical data from 2 columns - Python Pandas
I have a problem with creating a dataframe which holds a time interval where a measurement of temperature is in. As for now the dataframe has its index as time and another column as the measurements and i would like to have the time converted to an interval of 12 hours and the measurement to be the mean of the values in that timelapse. 我在创建一个保存时间间隔的数据帧时遇到问题。目前,该数据帧的索引为时间,另一列为测量值,我希望将时间转换为时间间隔12小时,测量值是该间隔中的平均值。
measurement
time
2016-11-04 08:49:25 17.730000
2016-11-04 10:23:52 18.059999
2016-11-04 11:02:09 18.370001
2016-11-04 12:04:20 18.090000
2016-11-04 14:26:43 18.320000
so instead of having each time related to the measurement i want the mean of the value of let's say 12 hours like this: 因此,我不想让每次都与测量相关,我想要这样的平均值是12小时:
measurement
time
2016-11-04 00:00:00 - 2016-11-04 12:00:00 17.730000
2016-11-04 12:00:00 - 2016-11-05 00:00:00 18.059999
2016-11-05 00:00:00 - 2016-11-05 12:00:00 18.370001
2016-11-05 12:00:00 - 2016-11-06 00:00:00 18.090000
2016-11-06 00:00:00 - 2016-11-06 12:00:00 18.320000
is there an easy way to do this with pandas? 有没有简单的方法可以做到这一点?
Later i would like to convert the measurements into intervals as well so that the data becomes boolean like this: 后来我也想将测量值转换为间隔,以便数据变为布尔值,如下所示:
17.0-18.0 18.0-19.0 19.0-20
time
2016-11-04 00:00:00 - 2016-11-04 12:00:00 1 0 0
2016-11-04 12:00:00 - 2016-11-05 00:00:00 0 1 0
2016-11-05 00:00:00 - 2016-11-05 12:00:00 0 1 0
2016-11-05 12:00:00 - 2016-11-06 00:00:00 0 1 0
2016-11-06 00:00:00 - 2016-11-06 12:00:00 0 1 0
EDIT: I used a solution first posted by Coldspeed 编辑:我使用了由Coldspeed首先发布的解决方案
df = pd.DataFrame({'timestamp':time.values, 'readings':readings.values})
df = df.groupby(pd.Grouper(key='timestamp', freq='12H'))['readings'].mean()
v = pd.cut(df, bins=[17,18,19,20,21,22,23,24,25,26,27,28], labels=['17-18','18-19','19-20','20-21','21-22','22-23','23-24','24-25','25-26','26-27','27-28'])
I know that the bins and labels could have been done but a for loop but this is just a quick fix. 我知道垃圾箱和标签本可以完成,但是要进行for循环,但这只是一个快速解决方案。 the groupby function which groups the value of 'timestamp' in the frequency of 12 hours and gets the readings mean value in the timelapse. groupby函数将“ timestamp”的值按12小时的频率进行分组,并获得timelapse中的读数平均值。
Then the cut function is used to categorize the means into their categories. 然后使用cut函数将均值归类。
result: 结果:
17-18 18-19 19-20 20-21 21-22 22-23 23-24 24-25 \
timestamp
2016-11-04 00:00:00 0 1 0 0 0 0 0 0
2016-11-04 12:00:00 0 1 0 0 0 0 0 0
2016-11-05 00:00:00 0 0 0 0 0 0 0 0
2016-11-05 12:00:00 1 0 0 0 0 0 0 0
2016-11-06 00:00:00 1 0 0 0 0 0 0 0
2016-11-06 12:00:00 0 0 0 0 0 0 0 0
2016-11-07 00:00:00 0 1 0 0 0 0 0 0
2016-11-07 12:00:00 1 0 0 0 0 0 0 0
2016-11-08 00:00:00 0 0 0 0 0 0 0 0
2016-11-08 12:00:00 0 0 0 0 0 0 0 0
2016-11-09 00:00:00 1 0 0 0 0 0 0 0
2016-11-09 12:00:00 1 0 0 0 0 0 0 0
2016-11-10 00:00:00 0 1 0 0 0 0 0 0
2016-11-10 12:00:00 0 0 0 0 0 0 0 0
2016-11-11 00:00:00 0 0 0 0 0 0 0 0
2016-11-11 12:00:00 0 0 0 0 0 0 0 0
2016-11-12 00:00:00 0 0 0 0 0 0 0 0
2016-11-12 12:00:00 0 0 0 0 0 0 0 0
2016-11-13 00:00:00 0 0 0 0 0 0 0 0
2016-11-13 12:00:00 0 0 0 0 0 0 0 0
2016-11-14 00:00:00 0 0 0 0 0 0 0 0
2016-11-14 12:00:00 0 1 0 0 0 0 0 0
2016-11-15 00:00:00 0 0 0 1 0 0 0 0
2016-11-15 12:00:00 0 0 0 0 0 1 0 0
2016-11-16 00:00:00 0 0 0 0 0 0 1 0
2016-11-16 12:00:00 0 0 0 0 0 0 0 0
2016-11-17 00:00:00 0 0 0 0 0 0 0 0
Use pd.cut
+ pd.get_dummies
: 使用pd.cut
+ pd.get_dummies
:
v = pd.cut(df.measurement, bins=[17, 18, 19, 20], labels=['17-18', '18-19', '19-20'])
pd.get_dummies(v)
17-18 18-19 19-20
0 1 0 0
1 0 1 0
2 0 1 0
3 0 1 0
4 0 1 0
IIUC you want to resample by 12 hour chunks, then create dummies. 您想按12小时的时间块对IIUC进行重新采样,然后创建假人。
pd.cut
is a perfectly acceptable way to cut the resultant data into bins. pd.cut
是将结果数据切成bin的一种完全可接受的方法。
However, I use np.searchsorted
to accomplish the task. 但是,我使用np.searchsorted
完成任务。
bins = np.array([17, 18, 19, 20])
labels = np.array(['<17', '17-18', '18-19', '19-20', '>20'])
resampled = df.resample('12H').measurement.mean()
pd.get_dummies(pd.Series(labels[bins.searchsorted(resampled.values)], resampled.index))
17-18 18-19 19-20 >20
2018-03-20 00:00:00 0 1 0 0
2018-03-20 12:00:00 1 0 0 0
2018-03-21 00:00:00 0 1 0 0
2018-03-21 12:00:00 0 0 0 1
2018-03-22 00:00:00 0 0 1 0
2018-03-22 12:00:00 0 0 0 1
Setup 设定
np.random.seed(int(np.pi * 1E6))
tidx = pd.date_range(pd.Timestamp('now'), freq='3H', periods=20)
df = pd.DataFrame(dict(measurement=np.random.rand(len(tidx)) * 6 + 17), tidx)
df
measurement
2018-03-20 06:58:30.484383 17.960744
2018-03-20 09:58:30.484383 18.572100
2018-03-20 12:58:30.484383 17.646766
2018-03-20 15:58:30.484383 19.025463
2018-03-20 18:58:30.484383 17.521399
2018-03-20 21:58:30.484383 17.318663
2018-03-21 00:58:30.484383 19.388553
2018-03-21 03:58:30.484383 19.520969
2018-03-21 06:58:30.484383 19.060640
2018-03-21 09:58:30.484383 17.106034
2018-03-21 12:58:30.484383 22.887546
2018-03-21 15:58:30.484383 18.437271
2018-03-21 18:58:30.484383 18.426362
2018-03-21 21:58:30.484383 20.558928
2018-03-22 00:58:30.484383 22.555121
2018-03-22 03:58:30.484383 17.139489
2018-03-22 06:58:30.484383 17.209499
2018-03-22 09:58:30.484383 19.466367
2018-03-22 12:58:30.484383 21.765692
2018-03-22 15:58:30.484383 19.680785
You can use pd.cut()
+ pd.get_dummies()
: 您可以使用pd.cut()
+ pd.get_dummies()
:
df["measurement"] = pd.cut(df["measurement"], bins=[17.0,18.0,19.0,20.0])
dummies = pd.get_dummies(df["measurement"])
对于第一个问题:您可以使用pandas.TimeGrouper
每12小时(或任何其他频率)进行分组,然后取各组的平均值。
df.groupby([pd.TimeGrouper(freq='12H')]).mean()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.