简体   繁体   English

根据数据框的条件在python中创建记录

[英]Creating records in python based on conditions from dataframe

I have a dataframe A like this. 我有一个这样的数据框A。

    Timestamp   A   B   C   D
1/1/2018 0:00   10  10  10  10
1/1/2018 0:10           10  25
1/1/2018 0:20           10  25
1/1/2018 0:30               25
1/1/2018 0:40               25
1/1/2018 0:50   25  25  25  25
1/1/2018 1:00   30  30  30  30
1/1/2018 1:10   42  42  42  42
1/1/2018 1:20               
1/1/2018 1:30               
1/1/2018 1:40   40      40  40
1/1/2018 1:50           35  35
1/1/2018 2:00               37
1/1/2018 2:10               49
1/1/2018 2:20   51  51      51

I want to delete some rows based on the following dataframe as follows. 我想基于以下数据框删除一些行,如下所示。

  StartTime       EndTime       Comment
1/1/2018 1:20  1/1/2018 1:30   to be removed
1/1/2018 2:00  1/1/2018 2:20   to be removed

To get the dataframe A without the above timestamps as 要获得没有上述时间戳的数据帧A为

      Timestamp     A   B   C   D
    1/1/2018 0:00   10  10  10  10
    1/1/2018 0:10           10  25
    1/1/2018 0:20           10  25
    1/1/2018 0:30               25
    1/1/2018 0:40               25
    1/1/2018 0:50   25  25  25  25
    1/1/2018 1:00   30  30  30  30
    1/1/2018 1:10   42  42  42  42
    1/1/2018 1:40   40      40  40
    1/1/2018 1:50           35  35

And I want results like as follows: 我想要如下结果:

StartTime       EndTime       Column    Comment
1/1/2018 0:10   1/1/2018 0:40   A        NULL
1/1/2018 1:50   1/1/2018 1:50   A        NULL
1/1/2018 0:10   1/1/2018 0:40   B        NULL
1/1/2018 1:40   1/1/2018 1:50   B        NULL
1/1/2018 0:30   1/1/2018 0:40   C        NULL

Thanks for the help in advance. 我在这里先向您的帮助表示感谢。

This is harder than what I thought , but following will do the job 这比我想的要难,但是跟随就能完成工作

v=np.concatenate([pd.date_range(x['StartTime'],x['EndTime'],freq='10Min') for _,x in remove.iterrows()])
df=df[~df.Timestamp.isin(v)]
df
Out[36]:
             Timestamp   A   B   C   D
0  1900-01-01 00:00:00  10  10  10  10
1  1900-01-01 00:10:00          10  25
2  1900-01-01 00:20:00          10  25
3  1900-01-01 00:30:00              25
4  1900-01-01 00:40:00              25
5  1900-01-01 00:50:00  25  25  25  25
6  1900-01-01 01:00:00  30  30  30  30
7  1900-01-01 01:10:00  42  42  42  42
10 1900-01-01 01:40:00  40      40  40
11 1900-01-01 01:50:00          35  35


v2=df.set_index('Timestamp').stack().swaplevel(1,0).sort_index(level=0)

v3=v2[v2==''].to_frame().reset_index(level=1)
v3['New']=v3.Timestamp.diff().astype('timedelta64[m]').ne(10).cumsum()

v3.groupby([v3.index.get_level_values(level=0),v3.New]).Timestamp.agg(['first','last']).reset_index()
Out[74]:
  level_0  New               first                last
0       A    1 1900-01-01 00:10:00 1900-01-01 00:40:00
1       A    2 1900-01-01 01:50:00 1900-01-01 01:50:00
2       B    3 1900-01-01 00:10:00 1900-01-01 00:40:00
3       B    4 1900-01-01 01:40:00 1900-01-01 01:50:00
4       C    5 1900-01-01 00:30:00 1900-01-01 00:40:00

Let's use IntervalIndex and boolean indexing: 让我们使用IntervalIndex和布尔索引:

Create Interval index for df_b 为df_b创建间隔索引

df_b.index = pd.IntervalIndex.from_arrays(df_b['StartTime'],df_b['EndTime'], closed='both')

df_a[~df_a[['Timestamp']].apply(lambda x: df_b.index.contains(x.values), axis=1)]

Output: 输出:

             Timestamp     A     B     C     D
0  2018-01-01 00:00:00  10.0  10.0  10.0  10.0
1  2018-01-01 00:10:00   NaN   NaN  10.0  25.0
2  2018-01-01 00:20:00   NaN   NaN  10.0  25.0
3  2018-01-01 00:30:00   NaN   NaN   NaN  25.0
4  2018-01-01 00:40:00   NaN   NaN   NaN  25.0
5  2018-01-01 00:50:00  25.0  25.0  25.0  25.0
6  2018-01-01 01:00:00  30.0  30.0  30.0  30.0
7  2018-01-01 01:10:00  42.0  42.0  42.0  42.0
10 2018-01-01 01:40:00  40.0   NaN  40.0  40.0
11 2018-01-01 01:50:00   NaN   NaN  35.0  35.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM