[英]Creating records in python based on conditions from dataframe
I have a dataframe A like this. 我有一个这样的数据框A。
Timestamp A B C D
1/1/2018 0:00 10 10 10 10
1/1/2018 0:10 10 25
1/1/2018 0:20 10 25
1/1/2018 0:30 25
1/1/2018 0:40 25
1/1/2018 0:50 25 25 25 25
1/1/2018 1:00 30 30 30 30
1/1/2018 1:10 42 42 42 42
1/1/2018 1:20
1/1/2018 1:30
1/1/2018 1:40 40 40 40
1/1/2018 1:50 35 35
1/1/2018 2:00 37
1/1/2018 2:10 49
1/1/2018 2:20 51 51 51
I want to delete some rows based on the following dataframe as follows. 我想基于以下数据框删除一些行,如下所示。
StartTime EndTime Comment
1/1/2018 1:20 1/1/2018 1:30 to be removed
1/1/2018 2:00 1/1/2018 2:20 to be removed
To get the dataframe A without the above timestamps as 要获得没有上述时间戳的数据帧A为
Timestamp A B C D
1/1/2018 0:00 10 10 10 10
1/1/2018 0:10 10 25
1/1/2018 0:20 10 25
1/1/2018 0:30 25
1/1/2018 0:40 25
1/1/2018 0:50 25 25 25 25
1/1/2018 1:00 30 30 30 30
1/1/2018 1:10 42 42 42 42
1/1/2018 1:40 40 40 40
1/1/2018 1:50 35 35
And I want results like as follows: 我想要如下结果:
StartTime EndTime Column Comment
1/1/2018 0:10 1/1/2018 0:40 A NULL
1/1/2018 1:50 1/1/2018 1:50 A NULL
1/1/2018 0:10 1/1/2018 0:40 B NULL
1/1/2018 1:40 1/1/2018 1:50 B NULL
1/1/2018 0:30 1/1/2018 0:40 C NULL
Thanks for the help in advance. 我在这里先向您的帮助表示感谢。
This is harder than what I thought , but following will do the job 这比我想的要难,但是跟随就能完成工作
v=np.concatenate([pd.date_range(x['StartTime'],x['EndTime'],freq='10Min') for _,x in remove.iterrows()])
df=df[~df.Timestamp.isin(v)]
df
Out[36]:
Timestamp A B C D
0 1900-01-01 00:00:00 10 10 10 10
1 1900-01-01 00:10:00 10 25
2 1900-01-01 00:20:00 10 25
3 1900-01-01 00:30:00 25
4 1900-01-01 00:40:00 25
5 1900-01-01 00:50:00 25 25 25 25
6 1900-01-01 01:00:00 30 30 30 30
7 1900-01-01 01:10:00 42 42 42 42
10 1900-01-01 01:40:00 40 40 40
11 1900-01-01 01:50:00 35 35
v2=df.set_index('Timestamp').stack().swaplevel(1,0).sort_index(level=0)
v3=v2[v2==''].to_frame().reset_index(level=1)
v3['New']=v3.Timestamp.diff().astype('timedelta64[m]').ne(10).cumsum()
v3.groupby([v3.index.get_level_values(level=0),v3.New]).Timestamp.agg(['first','last']).reset_index()
Out[74]:
level_0 New first last
0 A 1 1900-01-01 00:10:00 1900-01-01 00:40:00
1 A 2 1900-01-01 01:50:00 1900-01-01 01:50:00
2 B 3 1900-01-01 00:10:00 1900-01-01 00:40:00
3 B 4 1900-01-01 01:40:00 1900-01-01 01:50:00
4 C 5 1900-01-01 00:30:00 1900-01-01 00:40:00
Let's use IntervalIndex and boolean indexing: 让我们使用IntervalIndex和布尔索引:
Create Interval index for df_b 为df_b创建间隔索引
df_b.index = pd.IntervalIndex.from_arrays(df_b['StartTime'],df_b['EndTime'], closed='both')
df_a[~df_a[['Timestamp']].apply(lambda x: df_b.index.contains(x.values), axis=1)]
Output: 输出:
Timestamp A B C D
0 2018-01-01 00:00:00 10.0 10.0 10.0 10.0
1 2018-01-01 00:10:00 NaN NaN 10.0 25.0
2 2018-01-01 00:20:00 NaN NaN 10.0 25.0
3 2018-01-01 00:30:00 NaN NaN NaN 25.0
4 2018-01-01 00:40:00 NaN NaN NaN 25.0
5 2018-01-01 00:50:00 25.0 25.0 25.0 25.0
6 2018-01-01 01:00:00 30.0 30.0 30.0 30.0
7 2018-01-01 01:10:00 42.0 42.0 42.0 42.0
10 2018-01-01 01:40:00 40.0 NaN 40.0 40.0
11 2018-01-01 01:50:00 NaN NaN 35.0 35.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.