简体   繁体   English

在 python 中的分组 dataframe 中发现差异

[英]Finding difference within grouped dataframe in python

I have this dataframe:我有这个 dataframe:

                           Value      ID
          Timestamp
-----------------------------------------
2018-07-03 02:19:28          45      111
2018-07-03 02:19:29          36      111
2018-07-03 02:19:30          64      111
2018-07-03 02:19:31          35      111
2018-07-03 02:19:32          22      111 
...            
2018-07-03 03:43:14          35      232 
2018-07-03 03:43:15          44      232
2018-07-03 03:43:16          64      232
2018-07-03 03:43:17          44      232
2018-07-03 03:43:18          64      232
...
2018-07-03 05:20:28          35      555
2018-07-03 05:21:28          44      555
2018-07-03 05:22:28          75      555 
2018-07-03 05:19:28          84      555
2018-07-03 05:19:28          35      555 
...

Here, each ID represents a different "subset" of the total dataset.在这里,每个 ID 代表整个数据集的不同“子集”。 And so ID 111 is its own time series dataset, 232 is its own time series dataset, and 555 is its own time series dataset, with many more not shown.所以 ID 111 是它自己的时间序列数据集,232 是它自己的时间序列数据集,而 555 是它自己的时间序列数据集,还有很多没有显示。 What I want to do, using python, is for each of these data subsets find the number of peaks and valleys based on values in the "Values" column, and then append that to the original dataframe like so:我想要做的是,使用 python,对于这些数据子集中的每一个,根据“值”列中的值找到峰值和谷值的数量,然后将 append 与原始 Z6A8064B5DF4794555500553C4DZ 类似:

                          Value      ID       Curve_Changes
          Timestamp
------------------------------------------------------------
2018-07-03 02:19:28          45      111                  4
2018-07-03 02:19:29          36      111                  4
2018-07-03 02:19:30          64      111                  4
2018-07-03 02:19:31          35      111                  4
2018-07-03 02:19:32          22      111                  4  
...             
2018-07-03 03:43:14          35      232                  9    
2018-07-03 03:43:15          44      232                  9
2018-07-03 03:43:16          64      232                  9
2018-07-03 03:43:17          44      232                  9
2018-07-03 03:43:18          64      232                  9
...
2018-07-03 05:20:28          35      555                 12
2018-07-03 05:21:28          44      555                 12
2018-07-03 05:22:28          75      555                 12 
2018-07-03 05:19:28          84      555                 12
2018-07-03 05:19:28          35      555                 12 
...

Based on this ideal output example dataframe, this would mean that if you were to plot the time series data subset corresponding to ID 111, you would see 4 curve changes (whether a peak or valley), and if you were to plot the time series data subset corresponding to ID 232, you would see 9 curve changes (whether a peak or valley), etc. Based on this ideal output example dataframe, this would mean that if you were to plot the time series data subset corresponding to ID 111, you would see 4 curve changes (whether a peak or valley), and if you were to plot the time series对应于 ID 232 的数据子集,您会看到 9 条曲线变化(无论是峰还是谷)等。

I am trying to use this code to find the number of peaks and valleys:我正在尝试使用此代码来查找峰值和谷值的数量:

slopes = df["Value"].diff().bfill()
signs = slopes > 0
changes = signs.astype(float).diff(periods=-1).fillna(0)
num_changes = changes.abs().sum()

where num_changes is that number of curve changes I want.其中num_changes是我想要的曲线变化数。 I am able to get this to work on the dataframe as a whole, but I am confused by how I can get this to work for each individual time series data subset so as to produce the ideal output example dataframe I showed above.我能够让它在整个 dataframe 上工作,但我很困惑如何让它为每个单独的时间序列数据子集工作,以便产生理想的 output 示例 Z6A8064B5DF47945557DZCI 上面显示的47945557DZCI3。 I am not sure how this should be organized, but I am thinking this will be a .groupby() type task, where I think I will need to "groupby" the "ID" column, but I am not sure.我不确定这应该如何组织,但我认为这将是一个.groupby()类型的任务,我认为我需要对“ID”列进行“分组”,但我不确定。 How can I group by dataframe by the data subsets and find the number of curve changes for each subset and match those to the orginal dataframe?如何按数据子集按 dataframe 分组,并找到每个子集的曲线变化数并将其与原始 dataframe 匹配?

Use GroupBy.transform here for apply solution per groups to new column:在此处使用GroupBy.transform将每个组的解决方案应用于新列:

def f(x):
    #for debug
    print (x)
    slopes = x.diff().bfill()
    #for debug
    print (slopes)
    signs = slopes > 0
    changes = signs.astype(float).diff(periods=-1).fillna(0)
    return changes.abs().sum()

df['Curve_Changes'] = df.groupby('ID')['Value'].transform(f)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM