简体   繁体   English

使用开始日期和结束日期列重新采样

[英]Resampling with start and end date columns

I have a dataframe which looks like the following: 我有一个数据框,如下所示:

 START_TIME   END_TIME     TRIAL_No        itemnr
 2403950      2413067      Trial: 1        P14
 2413378      2422499      Trial: 2        P03
 2422814      2431931      Trial: 3        P13
 2432246      2441363      Trial: 4        P02
 2523540      2541257      Trial: 5        P11
 2541864      2560297      Trial: 6        P10
 2560916      2577249      Trial: 7        P05

The table goes on and on like that. 桌子一直这样下去。 The START_TIME and END_TIME are all in milliseconds which are the start and end time of a trial. START_TIME和END_TIME都以毫秒为单位,即试验的开始和结束时间。 So what I want to do is, I want to resample the START_TIME into 100milliseconds bin itme and interpolate the variables (TRIAL_No and itemnr) between each START_TIME and END_TIME. 所以我想做的是,我想将START_TIME重新采样到100毫秒bin itme中,并在每个START_TIME和END_TIME之间插入变量(TRIAL_No和itemnr)。 Outside of these regions, these variables should have the value "NA". 在这些区域之外,这些变量应具有值“ NA”。 For example, for the first row the START_TIME is 2403950 and the END_TIME is 2413067. The difference between them is 9117 milliseconds. 例如,对于第一行,START_TIME是2403950,而END_TIME是2413067。它们之间的差是9117毫秒。 So "Trial: 1" stays for 9117msecs which is for aroud 91 bin times since each bin time is 100msec apart. 因此,“试用版:1”停留9117毫秒,即91个bin时间,因为每个bin时间相隔100毫秒。 So I want to repeat "Trial_1" and "P14" 91 times in the resulting dataframe. 因此,我想在结果数据帧中重复“ Trial_1”和“ P14” 91次。 The same goes for the rest. 其余的也一样。 Looks like the following: 看起来如下:

Bin_time     TRIAL_No    itemnr
2403950      Trial: 1    P14
2404050      Trial: 1    P14
2404150      Trial: 1    P14
            ...
2413050      Trial: 1    P14
2413150      Trial: 2    P03
2413250      Trial: 2    P03

and so on. 等等。 I am not sure if it is possible directly in pandas or some preprocessing is needed. 我不确定是否可以直接在熊猫中进行,还是需要一些预处理。

After creating new dataframe by concat dataframes I can group it by row and apply resample on each of these groups (with method ffill to forward fill). 在通过concat数据帧创建新数据帧之后,我可以按行对其进行分组,并在每个组上应用resample (使用ffill方法进行正向填充)。

print df
#   START_TIME  END_TIME  TRIAL_No itemnr
#0     2403950   2413067  Trial: 1    P14
#1     2413378   2422499  Trial: 2    P03
#2     2422814   2431931  Trial: 3    P13
#3     2432246   2441363  Trial: 4    P02
#4     2523540   2541257  Trial: 5    P11
#5     2541864   2560297  Trial: 6    P10
#6     2560916   2577249  Trial: 7    P05

#PREDPROCESSING
#helper column for matching start and end rows
df['row'] = range(len(df))

#reshape to df - every row two times repeated for each date of START_TIME and END_TIME
starts = df[['START_TIME','TRIAL_No','itemnr','row']].rename(columns={'START_TIME':'Bin_time'})
ends = df[['END_TIME','TRIAL_No','itemnr','row']].rename(columns={'END_TIME':'Bin_time'})
df = pd.concat([starts, ends])
df = df.set_index('row', drop=True)
df = df.sort_index()

#convert miliseconds to timedelta for resampling by time 100ms
df['Bin_time'] = df['Bin_time'].astype('timedelta64[ms]')
print df
#           Bin_time  TRIAL_No itemnr
#row                                 
#0   00:40:03.950000  Trial: 1    P14
#0   00:40:13.067000  Trial: 1    P14
#1   00:40:13.378000  Trial: 2    P03
#1   00:40:22.499000  Trial: 2    P03
#2   00:40:22.814000  Trial: 3    P13
#2   00:40:31.931000  Trial: 3    P13
#3   00:40:32.246000  Trial: 4    P02
#3   00:40:41.363000  Trial: 4    P02
#4   00:42:03.540000  Trial: 5    P11
#4   00:42:21.257000  Trial: 5    P11
#5   00:42:21.864000  Trial: 6    P10
#5   00:42:40.297000  Trial: 6    P10
#6   00:42:40.916000  Trial: 7    P05
#6   00:42:57.249000  Trial: 7    P05

print df.dtypes
#Bin_time    timedelta64[ms]
#TRIAL_No             object
#itemnr               object
#dtype: object
#resample and fill missing data 
df = df.groupby(df.index).apply(lambda x: x.set_index('Bin_time').resample('100ms',how='first',fill_method='ffill'))

df = df.reset_index()
df = df.drop(['row'], axis=1)

#convert timedelta to integer back
df['Bin_time'] = (df['Bin_time'] / np.timedelta64(1, 'ms')).astype(int)

print df.head()
#  Bin_time  TRIAL_No itemnr
#0  2403950  Trial: 1    P14
#1  2404050  Trial: 1    P14
#2  2404150  Trial: 1    P14
#3  2404250  Trial: 1    P14
#4  2404350  Trial: 1    P14

EDIT: 编辑:

If you want get NaN outside of groups, you can change code after groupby : 如果要使NaN不在组中,可以在groupby之后更改代码:

#resample and fill missing data 
df = df.groupby(df.index).apply(lambda x: x.set_index('Bin_time').resample('100ms', how='first',fill_method='ffill'))

#reset only first level - drop index row
df = df.reset_index(level=0, drop=True)
#resample by 100ms, outside are NaN
df = df.resample('100ms', how='first')
df = df.reset_index()
#convert timedelta to integer back
df['Bin_time'] = (df['Bin_time'] / np.timedelta64(1, 'ms')).astype(int)

print df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 python将日期列表与数据框中的起始和结束日期列进行比较 - python compare date list to start and end date columns in dataframe 按天重新采样并分类具有日期时间开始和日期时间结束的数据帧 - Resampling by day and category a DataFrame that have datetime start and datetime end 给定开始和结束日期的熊猫重新采样时间序列 - Pandas resampling time series with given start and end dates 获取开始日期和结束日期 pandas 列之间的所有日期 - Get all dates between start and end date pandas columns 从具有开始和结束日期的 dataframe 列生成日期范围 - Generating date range from dataframe columns with start and end dates 检查 python dataframe 中不同列的开始和结束日期 - Checking start and end date of different columns in python dataframe 使用 Groupby 对行进行分组并转换开始日期时间和结束日期时间列的行的日期和时间 - Grouping rows with Groupby and converting date & time of rows of start date-time and end date- time columns 开始日期和结束日期的差异 - Difference of start date and end date 使用 python 根据 API 的开始日期和结束日期列自动提取每一天的行 - Extract each day rows automatically based on start date & end date columns from API using python 多个日期行以间隔开始日期和结束日期在 df 中转入 2 列 - Multiple dates rows to turn in 2 columns in a df with a interval start date and end date
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM