![](/img/trans.png)
[英]Fast selection of a time interval in a pandas DataFrame/Series
[英]pandas time interval to time series
如何使用 Python(和 Pandas)将时间间隔数据转换为时间序列数据?
这是我之前的数据帧作为时间间隔:
code start_dt end_dt ent_value
156600 1960-01-01 2016-04-21 H:CXP
156600 1960-01-01 2016-01-03 46927
156600 1998-08-31 2016-01-03 5516751
156600 1960-01-01 1998-08-30 4501242
对于 code 和 ent_value 的每个组合,我们希望该组合的开始和结束日期内的每一天在框架中都有一行(作为时间序列):
code as_of_dt ent_value
156600 1960-01-01 H:CXP
156600 1960-01-02 H:CXP
156600 1960-01-03 H:CXP
156600 1960-01-01 46927
156600 1960-01-02 46927
156600 1960-01-03 46927
156600 1960-01-01 5516751
156600 1960-01-02 5516751
156600 1960-01-03 5516751
...
156600 2016-01-01 H:CXP
156600 2016-01-02 H:CXP
156600 2016-01-03 H:CXP
156600 2016-01-01 46927
156600 2016-01-02 46927
156600 2016-01-03 46927
156600 2016-01-01 5516751
156600 2016-01-02 5516751
156600 2016-01-03 5516751
我如何以有效的方式做到这一点?
这是一个可能的解决方案。
data = pd.read_csv(open('/tmp/test.tab', 'r'), sep='\t')
tmp = [(e.code, pd.date_range(e.start_dt, e.end_dt, freq='1D'),
e.ent_value) for e in data.itertuples()]
res = [(line[0], date, line[2]) for date in line[1] for line in tmp]
df = pd.DataFrame(res)`
函数pd.date_range()
用于创建日期范围。
试试这个:
In [17]: %paste
(df.groupby(['code','ent_value'])
.apply(lambda x: pd.DataFrame({'as_of_dt':pd.date_range(x.start_dt.min(), x.end_dt.max())}))
.reset_index()
.drop('level_2', 1)
)
## -- End pasted text --
Out[17]:
code ent_value as_of_dt
0 156600 4501242 1960-01-01
1 156600 4501242 1960-01-02
2 156600 4501242 1960-01-03
3 156600 4501242 1960-01-04
4 156600 4501242 1960-01-05
5 156600 4501242 1960-01-06
6 156600 4501242 1960-01-07
7 156600 4501242 1960-01-08
8 156600 4501242 1960-01-09
9 156600 4501242 1960-01-10
10 156600 4501242 1960-01-11
11 156600 4501242 1960-01-12
12 156600 4501242 1960-01-13
13 156600 4501242 1960-01-14
14 156600 4501242 1960-01-15
15 156600 4501242 1960-01-16
16 156600 4501242 1960-01-17
17 156600 4501242 1960-01-18
18 156600 4501242 1960-01-19
19 156600 4501242 1960-01-20
20 156600 4501242 1960-01-21
21 156600 4501242 1960-01-22
22 156600 4501242 1960-01-23
23 156600 4501242 1960-01-24
24 156600 4501242 1960-01-25
25 156600 4501242 1960-01-26
26 156600 4501242 1960-01-27
27 156600 4501242 1960-01-28
28 156600 4501242 1960-01-29
29 156600 4501242 1960-01-30
... ... ... ...
61450 156600 H:CXP 2016-03-23
61451 156600 H:CXP 2016-03-24
61452 156600 H:CXP 2016-03-25
61453 156600 H:CXP 2016-03-26
61454 156600 H:CXP 2016-03-27
61455 156600 H:CXP 2016-03-28
61456 156600 H:CXP 2016-03-29
61457 156600 H:CXP 2016-03-30
61458 156600 H:CXP 2016-03-31
61459 156600 H:CXP 2016-04-01
61460 156600 H:CXP 2016-04-02
61461 156600 H:CXP 2016-04-03
61462 156600 H:CXP 2016-04-04
61463 156600 H:CXP 2016-04-05
61464 156600 H:CXP 2016-04-06
61465 156600 H:CXP 2016-04-07
61466 156600 H:CXP 2016-04-08
61467 156600 H:CXP 2016-04-09
61468 156600 H:CXP 2016-04-10
61469 156600 H:CXP 2016-04-11
61470 156600 H:CXP 2016-04-12
61471 156600 H:CXP 2016-04-13
61472 156600 H:CXP 2016-04-14
61473 156600 H:CXP 2016-04-15
61474 156600 H:CXP 2016-04-16
61475 156600 H:CXP 2016-04-17
61476 156600 H:CXP 2016-04-18
61477 156600 H:CXP 2016-04-19
61478 156600 H:CXP 2016-04-20
61479 156600 H:CXP 2016-04-21
[61480 rows x 3 columns]
使用较小的日期范围测试 DF:
In [19]: df
Out[19]:
code start_dt end_dt ent_value
0 156600 1960-01-01 1960-01-04 H:CXP
1 156600 1960-01-04 1960-01-09 46927
2 156600 1998-08-31 1998-09-04 5516751
3 156600 1965-01-01 1965-01-04 4501242
In [20]: (df.groupby(['code','ent_value'])
....: .apply(lambda x: pd.DataFrame({'as_of_dt':pd.date_range(x.start_dt.min(), x.end_dt.max())}))
....: .reset_index()
....: .drop('level_2', 1)
....: )
Out[20]:
code ent_value as_of_dt
0 156600 4501242 1965-01-01
1 156600 4501242 1965-01-02
2 156600 4501242 1965-01-03
3 156600 4501242 1965-01-04
4 156600 46927 1960-01-04
5 156600 46927 1960-01-05
6 156600 46927 1960-01-06
7 156600 46927 1960-01-07
8 156600 46927 1960-01-08
9 156600 46927 1960-01-09
10 156600 5516751 1998-08-31
11 156600 5516751 1998-09-01
12 156600 5516751 1998-09-02
13 156600 5516751 1998-09-03
14 156600 5516751 1998-09-04
15 156600 H:CXP 1960-01-01
16 156600 H:CXP 1960-01-02
17 156600 H:CXP 1960-01-03
18 156600 H:CXP 1960-01-04
假设您有以下 DataFrame,称为df
(请参阅下文以了解如何创建它):
(see below to recreate this example)
id starttime endtime flag
0 A 2020-03-18 2020-03-20 y
1 B 2020-03-20 2020-03-23 n
2 C 2020-03-19 2020-03-21 y
然后,您可以通过在 date_range 的帮助下遍历所有列来创建新的数据框:
new_df = pd.DataFrame(
data = ((row.id, row.flag, date)
# iterate over rows
for row in df.itertuples()
# expad the range into 1 day intervals
for date in pd.date_range(row.starttime, row.endtime, freq='1D')),
columns = ['name', 'flag', 'interval']))
你会以这样的方式结束:
name flag interval
0 A y 2020-03-18
1 A y 2020-03-19
2 A y 2020-03-20
3 B n 2020-03-20
4 B n 2020-03-21
5 B n 2020-03-22
6 B n 2020-03-23
7 C y 2020-03-19
8 C y 2020-03-20
9 C y 2020-03-21
import pandas as pd
df = pd.DataFrame({
'id': ['A', 'B', 'C'],
'starttime': ['2020-03-18', '2020-03-20','2020-03-19' ],
'endtime': ['2020-03-20', '2020-03-23','2020-03-21'],
'flag': ['y','n','y']
})
df['starttime'] = pd.to_datetime(df['starttime'])
df['endtime'] = pd.to_datetime(df['endtime'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.