简体   繁体   English

将带有“列表列”中数据的 pandas df 转换为长格式的时间序列。 使用三列:[数据列表] + [时间戳] + [持续时间]

[英]Convert pandas df with data in a "list column" into a time series in long format. Use three columns: [list of data] + [timestamp] + [duration]

The aim is to convert a dataframe with a list column as the data column (and thus with just one timestamp and duration per row) into a time series in long format with a datetimeindex for each single item.目的是将 dataframe 以列表列作为数据列(因此每行只有一个时间戳和持续时间)转换为长格式的时间序列,每个项目都有一个datetimeindex时间索引。

In the result, there is no sequence/list per row for the data anymore, but just one value column.结果,数据的每行不再有序列/列表,只有一个value列。

df_test = pd.DataFrame({'timestamp': [1462352000000000000, 1462352100000000000, 1462352200000000000, 1462352300000000000],
                        'list': [[1,2,1,9], [2,2,3,0], [1,3,3,0], [1,1,3,9]],
                        'duration_sec': [3.0, 3.0, 3.0, 3.0]})

tdi = pd.DatetimeIndex(df_test.timestamp)
df_test.set_index(tdi, inplace=True)
df_test.drop(columns='timestamp', inplace=True)
df_test.index.name = 'datetimeindex'

Out:出去:

                       list          duration_sec
datetimeindex                                      
2016-05-04 08:53:20  [1, 2, 1, 9]           3.0
2016-05-04 08:55:00  [2, 2, 3, 0]           3.0
2016-05-04 08:56:40  [1, 3, 3, 0]           3.0
2016-05-04 08:58:20  [1, 1, 3, 9]           3.0

The aim is:目的是:

                   value
datetimeindex
2016-05-04 08:53:20  1
2016-05-04 08:53:21  2
2016-05-04 08:53:22  1
2016-05-04 08:53:23  9
2016-05-04 08:55:00  2
2016-05-04 08:55:01  2
2016-05-04 08:55:02  3
2016-05-04 08:55:03  0
2016-05-04 08:56:40  1
2016-05-04 08:56:41  3
2016-05-04 08:56:42  3
2016-05-04 08:56:43  0
2016-05-04 08:58:20  1
2016-05-04 08:58:21  1
2016-05-04 08:58:22  3
2016-05-04 08:58:23  9

Mind that this means not just to take 1 second for each item;请注意,这意味着不仅仅是为每个项目花费 1 秒; this was just taken to simplify the example.这只是为了简化示例。 Instead, it is about 4 items in a sequence that has a given duration of, for example, 3.0 seconds (which may also vary from row to row), and where the first item of each sequence always starts at "time 0", meaning that the seconds per item should be calculated like相反,它是一个序列中的大约 4 个项目,具有给定的持续时间,例如 3.0 秒(也可能因行而异),并且每个序列的第一个项目始终从“时间 0”开始,这意味着每个项目的秒数应该像这样计算

[3.0 sec / (4-1) items] = 1 sec. [3.0 秒/(4-1) 项] = 1 秒。

Context:语境:

The example shows conversion to Datetimeindex since this makes it suitable for seasonal_decompose() , see this the first search hit.该示例显示了到Datetimeindex的转换,因为这使其适用于seasonal_decompose() ,请参阅第一个搜索结果。

There, the resulting df looks like this:在那里,生成的 df 看起来像这样:

df_test2 = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date')

Out:出去:

                value
date                 
1991-07-01   3.526591
1991-08-01   3.180891
1991-09-01   3.252221
1991-10-01   3.611003
1991-11-01   3.565869
              ...
2008-02-01  21.654285
2008-03-01  18.264945
2008-04-01  23.107677
2008-05-01  22.912510
2008-06-01  19.431740

[204 rows x 1 columns]

And then it is easy to apply a seasonal_decompose() via additive decomposition model:然后很容易通过additive分解应用seasonal_decompose() model:

result_add = seasonal_decompose(df_test2['value'], model='additive', extrapolate_trend='freq')

# Plot
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

1个

Now the same is needed for the df_test above.现在上面的df_test也需要同样的东西。

Use DataFrame.explode first and then add counter by GroupBy.cumcount and to_timedelta to df.index :首先使用DataFrame.explode然后通过GroupBy.cumcountto_timedelta添加计数器到df.index

df_test = df_test.explode('nestedList')
df_test.index += pd.to_timedelta(df_test.groupby(level=0).cumcount(), unit='s')

print (df_test)
                    nestedList  duration_sec
2016-05-04 08:53:20          1           3.0
2016-05-04 08:53:21          2           3.0
2016-05-04 08:53:22          1           3.0
2016-05-04 08:53:23          9           3.0
2016-05-04 08:55:00          2           3.0
2016-05-04 08:55:01          2           3.0
2016-05-04 08:55:02          3           3.0
2016-05-04 08:55:03          0           3.0
2016-05-04 08:56:40          1           3.0
2016-05-04 08:56:41          3           3.0
2016-05-04 08:56:42          3           3.0
2016-05-04 08:56:43          0           3.0
2016-05-04 08:58:20          1           3.0
2016-05-04 08:58:21          1           3.0
2016-05-04 08:58:22          3           3.0
2016-05-04 08:58:23          9           3.0

EDIT:编辑:

df_test = df_test.explode('nestedList') 
sizes = df_test.groupby(level=0)['nestedList'].transform('size').sub(1)
duration = df_test['duration_sec'].div(sizes) 
df_test.index += pd.to_timedelta(df_test.groupby(level=0).cumcount() * duration, unit='s') 

EDIT2 by asker:提问者的 EDIT2:

With the resulting df this simple application of decompose() is now possible, which was the final aim:通过生成的 df,现在可以实现 decompose() 的这种简单应用,这是最终目标:

result_add = seasonal_decompose(x=df_test['nestedList'], model='additive', extrapolate_trend='freq', period=int(len(df_test)/2))
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

简单的应用程序,提问者粘贴

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM