使用熊猫系列使用以前的数据进行应用

Question

希望能对您有所帮助。 我收到一系列相差30秒的信号。 我的问题是，在每个“突发”信号中，信号的毫秒数有所不同。 例如：

Burst    Timestamp
1    2018-02-14 09:50:46.752
1    2018-02-14 09:50:46.818
1    2018-02-14 09:50:47.030
1    2018-02-14 09:50:46.990
1    2018-02-14 09:50:46.828
1    2018-02-14 09:50:47.989
1    2018-02-14 09:50:47.937
2    2018-02-14 09:51:40.794
2    2018-02-14 09:51:40.985
2    2018-02-14 09:51:41.014
2    2018-02-14 09:51:41.043
2    2018-02-14 09:51:41.928
2    2018-02-14 09:51:42.002
3    2018-02-14 09:55:35.788
3    2018-02-14 09:55:35.823
3    2018-02-14 09:55:36.092
3    2018-02-14 09:55:35.997
3    2018-02-14 09:55:36.018
3    2018-02-14 09:55:36.115
3    2018-02-14 09:55:35.918

我没有“爆发”列。 我想做的是为同一“爆发”的所有信号分配相同的时间戳，或者是一种获取“爆发”列的方法，因此我可以使用.pivot（）方法。 我在数据库中有超过1000万个条目，现在我使用“ For”执行此操作，但是完成该任务需要花费10多个小时，我认为这可以通过Serie中的apply＆lambda函数完成，但是在函数中使用系列的2个元素。 我现在使用的解决方案：

def group_by_date(response=pd.DataFrame, secs = int: 2):
    response = response.sort_values(by='timeSource')
    for i in range(1,len(response)):
        if response.timeSource[i] - response.timeSource[i-1]<datetime.timedelta(0,secs):
            response.loc[i,"timeSource"]= response.loc[i-1,"timeSource"]

    return response

注意：timeSource与时间戳在同一列。这让我发疯，欢迎任何帮助。

提前致谢！！！ :)

Answer 1

只是为了更好地了解您的代码：

import pandas as pd 
import datetime
#you do not need default response to class, it is rather meaningless
# you can use type annotations instread in python 3.6+
#def group_by_date(response=pd.DataFrame, secs = 2):    
def group_by_date(response: pd.DataFrame, secs: int = 2):
    # is timeSource the same column as Timestamp? or it is a different one?
    response = response.sort_values(by='timeSource')
    for i in range(1,len(response)):
        # 'de' is 'of'?
        print(str(i) +" de "+str(len(response)))
        # ERROR: now what is *segundos*? you means *secs*?
        if response.timeSource[i] - response.timeSource[i-1]<datetime.timedelta(0,segundos):
            # you are replacing newer value with an older one?
            response.loc[i,"timeSource"] = response.loc[i-1,"timeSource"]
    return response

基本上，我认为您可以创建带有移位的另一个系列并应用两个系列的功能来获得爆发。 这可能是
尽可能多的循环，但您可以希望它在熊猫内部更快。 另外，打印语句会消耗您的一些CPU时间，请考虑删除它。

在解决之前，我需要对激励代码进行澄清。

更新：下面的代码应该工作

doc ="""1    2018-02-14 09:50:46.752
1    2018-02-14 09:50:46.818
1    2018-02-14 09:50:47.030
1    2018-02-14 09:50:46.990
1    2018-02-14 09:50:46.828
1    2018-02-14 09:50:47.989
1    2018-02-14 09:50:47.937
2    2018-02-14 09:51:40.794
2    2018-02-14 09:51:40.985
2    2018-02-14 09:51:41.014
2    2018-02-14 09:51:41.043
2    2018-02-14 09:51:41.928
2    2018-02-14 09:51:42.002
3    2018-02-14 09:55:35.788
3    2018-02-14 09:55:35.823
3    2018-02-14 09:55:36.092
3    2018-02-14 09:55:35.997
3    2018-02-14 09:55:36.018
3    2018-02-14 09:55:36.115
3    2018-02-14 09:55:35.918"""

# change this to faster import for real file 
df = pd.read_csv(io.StringIO(doc), 
                 header=None, 
                 sep='    ', 
                 engine='python',
                 names=["burst",  "time"], 
                 converters={'burst':int, 'time':pd.Timestamp})
df = df[['time']]
# find bursts
df['lag'] = df['time'].shift(1)
threshold = datetime.timedelta(seconds=2)
df['delta'] = df['time'] - df['lag']
df['is_new_group'] = df['delta'] > threshold
df['burst'] = df['is_new_group'].cumsum()
# oneliner, skipping saving intermediate columns
df['burst2'] = ((df['time']- df['time'].shift(1)) > threshold).cumsum()
assert (df['burst2'] == df['burst']).all()


# maybe the next thing you want is:
group = df.groupby('burst')
# assuming the timestamps are sorted chronoligically
res1 = group.first()['time']
res2 = group.count()['time']
# again a one-liner
res = group['time'].agg(['first','count'])

使用熊猫系列使用以前的数据进行应用

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-08-16 11:31:10

使用熊猫系列使用以前的数据进行应用

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-08-16 11:31:10

解决方案1
0 已采纳 2018-08-16 11:31:10