[英]First and last occurrence of log entries
I am aware that there are already many similar questions on stackoverflow, but I just could not find anything, nor could I come up with my own solution... Here we go:我知道在 stackoverflow 上已经有很多类似的问题,但我找不到任何东西,也无法提出自己的解决方案......这里我们 go:
I run consistency checks on a changing set of data X
and with every run I might find some violations.我对一组不断变化的数据
X
运行一致性检查,每次运行我都可能发现一些违规行为。 I can identify the cases of the violations with a unique key
.我可以用唯一的
key
识别违规情况。 Once a violation is resolved in the original data set X
, the violation obviously disappears from the checks.一旦在原始数据集
X
中解决了违规,违规显然会从检查中消失。 A violation can reappear at a later date, and should then be considered new.违规可能会在以后再次出现,然后应被视为新违规。
Every time I run the checks, I create a log file, which records the date
and key
of the violation.每次运行检查时,我都会创建一个日志文件,其中记录违规的
date
和key
。
From this logfile, I would like to extract like how many cases / violations were in status open, and how many cases have been cumulatively closed at any date in the log file:从这个日志文件中,我想提取有多少案例/违规处于打开状态,以及在日志文件中的任何日期累积关闭了多少案例:
This is another transformation of the left table which might help understand the result (numbers in (.) refer to the corresponding line in the left table):这是左表的另一种转换,可能有助于理解结果((.) 中的数字指的是左表中的相应行):
AAA is opened on 5/1/2020 (1) and closed on 5/4/2020 because there is no entry for AAA on 5/4/2020, but we know tests were run on that date (6). AAA 于 2020 年 5 月 1 日开放 (1),并于 2020 年 5 月 4 日关闭,因为 2020 年 5 月 4 日没有 AAA 条目,但我们知道测试是在该日期运行的 (6)。
AAA is opened again after it was closed on 5/5/2020, and closed on 5/7/2020, because tests were run on that date, and AAA did not show up anymore (8). AAA 在 2020 年 5 月 5 日关闭后再次打开,并于 2020 年 5 月 7 日关闭,因为在该日期运行了测试,而 AAA 不再出现 (8)。 5/6/2020 never showed.
2020 年 5 月 6 日从未出现。
BBB was opened on 5/2/2020, and appeared throughout 5/7/2020, so never closed. BBB 于 2020 年 5 月 2 日开放,并出现在 2020 年 5 月 7 日,因此从未关闭。
Here is the skeleton code:这是骨架代码:
df = pd.DataFrame(
{
"date": [
"5/1/2020",
"5/2/2020",
"5/3/2020",
"5/5/2020",
"5/2/2020",
"5/3/2020",
"5/4/2020",
"5/5/2020",
"5/7/2020",
],
"key": ["AAA"] * 4 + ["BBB"] * 5,
}
)
df['date'] = df['date'].astype("datetime64")
I believe I have to work with the date ladder ( date_ladder = df[['date']].drop_duplicates().sort_values(by='date')
) and do an outer merge with df, to get values for each key on all dates, and then continue from there.我相信我必须使用日期阶梯(
date_ladder = df[['date']].drop_duplicates().sort_values(by='date')
)并与 df 进行外部合并,以获取每个键的值所有日期,然后从那里继续。 But I already fail there to create that merge.但是我已经无法在那里创建该合并。
So, here is an attempt after sleeping over it... First, we will add a dummy variable and then pivot over the dataframe and melt it back again:所以,这是在睡过它之后的尝试......首先,我们将添加一个虚拟变量,然后在 dataframe 上添加一个 pivot 并再次将其融化:
df['v'] = True
pvt = df.pivot_table(index='date', columns='key', values='v').fillna(False)
df2 = pvt.reset_index().melt(id_vars='date')
"""
df2:
date key value
0 2020-05-01 AAA True
1 2020-05-02 AAA True
2 2020-05-03 AAA True
3 2020-05-04 AAA False
4 2020-05-05 AAA True
5 2020-05-07 AAA False
6 2020-05-01 BBB False
7 2020-05-02 BBB True
8 2020-05-03 BBB True
9 2020-05-04 BBB True
10 2020-05-05 BBB True
11 2020-05-07 BBB True
"""
Now we will shift the dataframe by 1 and then check if there is a switch from True
to False
(or vice versa), indicating that a record in the original log file appeared or dissappeared:现在我们将 dataframe 移位 1,然后检查是否有从
True
到False
的切换(反之亦然),表明原始日志文件中的一条记录出现或消失:
df2['switch'] = (df2['value'] ^ df2.shift()['value']) |( (df2['key']!=df2.shift()['key']) & df2['value'])
df2['start'] = df2['switch'] & df2['value']
df2['end'] = df2['switch'] & ~df2['value']
df2['status'] = ''
df2.loc[df2['start'], 'status'] = 'Start'
df2.loc[df2['end'], 'status'] = 'End'
"""
date key value switch start end status
0 2020-05-01 AAA True True True False Start
1 2020-05-02 AAA True False False False
2 2020-05-03 AAA True False False False
3 2020-05-04 AAA False True False True End
4 2020-05-05 AAA True True True False Start
5 2020-05-07 AAA False True False True End
6 2020-05-01 BBB False False False False
7 2020-05-02 BBB True True True False Start
8 2020-05-03 BBB True False False False
9 2020-05-04 BBB True False False False
10 2020-05-05 BBB True False False False
11 2020-05-07 BBB True False False False
"""
We can condense this with pd.melt()
:我们可以用
pd.melt()
压缩它:
df3 = df2.loc[df2['status']!=''].melt(id_vars=['key', 'status'], value_vars='date', value_name='date').drop('variable', axis=1)
"""
key status date
0 AAA Start 2020-05-01
1 AAA End 2020-05-04
2 AAA Start 2020-05-05
3 AAA End 2020-05-07
4 BBB Start 2020-05-02
"""
The next step is to sort the data by date, because we want to know how many items are open or closed at any time.下一步是按日期对数据进行排序,因为我们想知道在任何时候有多少项目是打开或关闭的。
df4 = df2.sort_values(by='date')
df4['opened'] = df4['start'].cumsum()
df4['closed'] = df4['end'].cumsum()
"""
date key value switch ... end status opened closed
0 2020-05-01 AAA True True ... False Start 1 0
6 2020-05-01 BBB False False ... False 1 0
1 2020-05-02 AAA True False ... False 1 0
7 2020-05-02 BBB True True ... False Start 2 0
2 2020-05-03 AAA True False ... False 2 0
8 2020-05-03 BBB True False ... False 2 0
3 2020-05-04 AAA False True ... True End 2 1
9 2020-05-04 BBB True False ... False 2 1
4 2020-05-05 AAA True True ... False Start 3 1
10 2020-05-05 BBB True False ... False 3 1
5 2020-05-07 AAA False True ... True End 3 2
11 2020-05-07 BBB True False ... False 3 2
"""
Finally, we will run a groupby()
and calculate how many items are open on a specific date and plot the result:最后,我们将运行
groupby()
并计算在特定日期打开的项目数量和 plot 结果:
result = df4.groupby('date').agg(opened=pd.NamedAgg(column='opened', aggfunc='max'), closed = pd.NamedAgg(column='closed', aggfunc='max'))
result['open_on'] = result['opened']-result['closed']
result[['closed', 'open_on']].plot.area()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.