日志条目的第一次和最后一次出现

Question

I am aware that there are already many similar questions on stackoverflow, but I just could not find anything, nor could I come up with my own solution... Here we go:我知道在 stackoverflow 上已经有很多类似的问题，但我找不到任何东西，也无法提出自己的解决方案......这里我们 go：

I run consistency checks on a changing set of data X and with every run I might find some violations.我对一组不断变化的数据X运行一致性检查，每次运行我都可能发现一些违规行为。 I can identify the cases of the violations with a unique key .我可以用唯一的key识别违规情况。 Once a violation is resolved in the original data set X , the violation obviously disappears from the checks.一旦在原始数据集X中解决了违规，违规显然会从检查中消失。 A violation can reappear at a later date, and should then be considered new.违规可能会在以后再次出现，然后应被视为新违规。

Every time I run the checks, I create a log file, which records the date and key of the violation.每次运行检查时，我都会创建一个日志文件，其中记录违规的date和key 。

From this logfile, I would like to extract like how many cases / violations were in status open, and how many cases have been cumulatively closed at any date in the log file:从这个日志文件中，我想提取有多少案例/违规处于打开状态，以及在日志文件中的任何日期累积关闭了多少案例：

This is another transformation of the left table which might help understand the result (numbers in (.) refer to the corresponding line in the left table):这是左表的另一种转换，可能有助于理解结果（(.) 中的数字指的是左表中的相应行）：

AAA is opened on 5/1/2020 (1) and closed on 5/4/2020 because there is no entry for AAA on 5/4/2020, but we know tests were run on that date (6). AAA 于 2020 年 5 月 1 日开放 (1)，并于 2020 年 5 月 4 日关闭，因为 2020 年 5 月 4 日没有 AAA 条目，但我们知道测试是在该日期运行的 (6)。

AAA is opened again after it was closed on 5/5/2020, and closed on 5/7/2020, because tests were run on that date, and AAA did not show up anymore (8). AAA 在 2020 年 5 月 5 日关闭后再次打开，并于 2020 年 5 月 7 日关闭，因为在该日期运行了测试，而 AAA 不再出现 (8)。 5/6/2020 never showed. 2020 年 5 月 6 日从未出现。

BBB was opened on 5/2/2020, and appeared throughout 5/7/2020, so never closed. BBB 于 2020 年 5 月 2 日开放，并出现在 2020 年 5 月 7 日，因此从未关闭。

Here is the skeleton code:这是骨架代码：

df = pd.DataFrame(
    {
        "date": [
            "5/1/2020",
            "5/2/2020",
            "5/3/2020",
            "5/5/2020",
            "5/2/2020",
            "5/3/2020",
            "5/4/2020",
            "5/5/2020",
            "5/7/2020",
        ],
        "key": ["AAA"] * 4 + ["BBB"] * 5,
    }
)
df['date'] = df['date'].astype("datetime64")

I believe I have to work with the date ladder ( date_ladder = df[['date']].drop_duplicates().sort_values(by='date') ) and do an outer merge with df, to get values for each key on all dates, and then continue from there.我相信我必须使用日期阶梯（ date_ladder = df[['date']].drop_duplicates().sort_values(by='date') ）并与 df 进行外部合并，以获取每个键的值所有日期，然后从那里继续。 But I already fail there to create that merge.但是我已经无法在那里创建该合并。

Answer 1

So, here is an attempt after sleeping over it... First, we will add a dummy variable and then pivot over the dataframe and melt it back again:所以，这是在睡过它之后的尝试......首先，我们将添加一个虚拟变量，然后在 dataframe 上添加一个 pivot 并再次将其融化：

df['v'] = True
pvt = df.pivot_table(index='date', columns='key', values='v').fillna(False)
df2 = pvt.reset_index().melt(id_vars='date')
"""
df2:
         date  key  value
0  2020-05-01  AAA   True
1  2020-05-02  AAA   True
2  2020-05-03  AAA   True
3  2020-05-04  AAA  False
4  2020-05-05  AAA   True
5  2020-05-07  AAA  False
6  2020-05-01  BBB  False
7  2020-05-02  BBB   True
8  2020-05-03  BBB   True
9  2020-05-04  BBB   True
10 2020-05-05  BBB   True
11 2020-05-07  BBB   True
"""

Now we will shift the dataframe by 1 and then check if there is a switch from True to False (or vice versa), indicating that a record in the original log file appeared or dissappeared:现在我们将 dataframe 移位 1，然后检查是否有从True到False的切换（反之亦然），表明原始日志文件中的一条记录出现或消失：

df2['switch'] = (df2['value'] ^ df2.shift()['value']) |( (df2['key']!=df2.shift()['key']) & df2['value'])
df2['start'] = df2['switch'] & df2['value']
df2['end'] = df2['switch'] & ~df2['value']

df2['status'] = ''
df2.loc[df2['start'], 'status'] = 'Start'
df2.loc[df2['end'], 'status'] = 'End'
"""
         date  key  value  switch  start    end status
0  2020-05-01  AAA   True    True   True  False  Start
1  2020-05-02  AAA   True   False  False  False       
2  2020-05-03  AAA   True   False  False  False       
3  2020-05-04  AAA  False    True  False   True    End
4  2020-05-05  AAA   True    True   True  False  Start
5  2020-05-07  AAA  False    True  False   True    End
6  2020-05-01  BBB  False   False  False  False       
7  2020-05-02  BBB   True    True   True  False  Start
8  2020-05-03  BBB   True   False  False  False       
9  2020-05-04  BBB   True   False  False  False       
10 2020-05-05  BBB   True   False  False  False       
11 2020-05-07  BBB   True   False  False  False       
"""

We can condense this with pd.melt() :我们可以用pd.melt()压缩它：

df3 = df2.loc[df2['status']!=''].melt(id_vars=['key', 'status'], value_vars='date', value_name='date').drop('variable', axis=1)
"""
   key status       date
0  AAA  Start 2020-05-01
1  AAA    End 2020-05-04
2  AAA  Start 2020-05-05
3  AAA    End 2020-05-07
4  BBB  Start 2020-05-02
"""

The next step is to sort the data by date, because we want to know how many items are open or closed at any time.下一步是按日期对数据进行排序，因为我们想知道在任何时候有多少项目是打开或关闭的。

df4 = df2.sort_values(by='date')
df4['opened'] = df4['start'].cumsum()
df4['closed'] = df4['end'].cumsum()
"""
         date  key  value  switch  ...    end  status opened  closed
0  2020-05-01  AAA   True    True  ...  False   Start      1       0
6  2020-05-01  BBB  False   False  ...  False              1       0
1  2020-05-02  AAA   True   False  ...  False              1       0
7  2020-05-02  BBB   True    True  ...  False   Start      2       0
2  2020-05-03  AAA   True   False  ...  False              2       0
8  2020-05-03  BBB   True   False  ...  False              2       0
3  2020-05-04  AAA  False    True  ...   True     End      2       1
9  2020-05-04  BBB   True   False  ...  False              2       1
4  2020-05-05  AAA   True    True  ...  False   Start      3       1
10 2020-05-05  BBB   True   False  ...  False              3       1
5  2020-05-07  AAA  False    True  ...   True     End      3       2
11 2020-05-07  BBB   True   False  ...  False              3       2
"""

Finally, we will run a groupby() and calculate how many items are open on a specific date and plot the result:最后，我们将运行groupby()并计算在特定日期打开的项目数量和 plot 结果：

result = df4.groupby('date').agg(opened=pd.NamedAgg(column='opened', aggfunc='max'), closed = pd.NamedAgg(column='closed', aggfunc='max'))
result['open_on'] = result['opened']-result['closed']
result[['closed', 'open_on']].plot.area()

日志条目的第一次和最后一次出现

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-05-08 08:54:15

日志条目的第一次和最后一次出现

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-05-08 08:54:15

解决方案1
0 已采纳 2020-05-08 08:54:15