[英]How to add a column to a pandas DataFrame with values based on matching values in two DataFrames
I am working with two pandas DataFrames.我正在使用两个 pandas 数据帧。 One contains the performance data of different servers for every hour and looks something like this:
一个包含不同服务器每小时的性能数据,如下所示:
Date![]() |
time![]() |
server_name![]() |
CPU![]() |
Memory ![]() |
---|---|---|---|---|
2020-10-25 ![]() |
300 ![]() |
server1![]() |
90.2 ![]() |
64.4 ![]() |
2020-10-25 ![]() |
300 ![]() |
server2![]() |
50.4 ![]() |
23.3 ![]() |
In this case, '300' in the column 'time' means 3am.在这种情况下,“时间”列中的“300”表示凌晨 3 点。
The second DataFrame contains data to errors for the different servers and looks something like this:第二个 DataFrame 包含不同服务器的错误数据,如下所示:
server_name![]() |
timestamp![]() |
---|---|
server1![]() |
2020-10-25 00:45:04 ![]() |
server2![]() |
2020-10-25 03:45:04 ![]() |
I would like to have a column added to the first DataFrame with the performance metrics, which indicates for every server for every hour if an error occurred at this time.我想在第一个 DataFrame 中添加一列,其中包含性能指标,如果此时发生错误,它会指示每个服务器每小时的情况。 Please note that an error which occurred at 3:45am should be assigned to the row for 3am for the respective server.
请注意,应将凌晨 3:45 发生的错误分配给相应服务器的凌晨 3 点的行。 It should look something like this:
它应该看起来像这样:
Date![]() |
time![]() |
server_name![]() |
CPU![]() |
Memory ![]() |
error![]() |
---|---|---|---|---|---|
2020-10-25 ![]() |
300 ![]() |
server1![]() |
90.2 ![]() |
64.4 ![]() |
0 ![]() |
2020-10-25 ![]() |
300 ![]() |
server2![]() |
50.4 ![]() |
23.3 ![]() |
1 ![]() |
In this case, '1' in the column 'error' would mean that at this time, an error occurred on the server.在这种情况下,“错误”列中的“1”表示此时服务器上发生了错误。
I already tried merging the DataFrames on date, time and server_name and many other approaches, but I just don't get the desired results.我已经尝试过在日期、时间和 server_name 上合并 DataFrames 以及许多其他方法,但我只是没有得到想要的结果。
Assuming df1
is your first dataframe, and df2
is the second one, you could add a timestamp column to df1
by adding the Date
and time
column, and then use merge_asof
to bind each row for the second frame to a row from that modified dataframe.假设
df1
是您的第一个 dataframe,而df2
是第二个,您可以通过添加Date
和time
列将时间戳列添加到df1
,然后使用merge_asof
将第二帧的每一行绑定到修改后的 dataframe 中的一行。
From there, you could merge that new data frame into the first one, and a groupby
and count
should give the expected result.从那里,您可以将该新数据框合并到第一个数据框中,并且
groupby
和count
应该会给出预期的结果。
Possible code:可能的代码:
df3 = pd.merge_asof(df2, df1.assign(timestamp=df1['Date']
+ pd.to_timedelta(df1['time']/100, 'H')),
by='server_name', on='timestamp',
tolerance=pd.Timedelta('1H'))
print(df3)
result = df1.merge(df3[['server_name', 'timestamp', 'Date', 'time']], 'left',
on=['server_name', 'Date', 'time']
).groupby(['server_name', 'Date', 'time', 'CPU', 'Memory']
).count().rename(columns={'timestamp': 'error'}
).reset_index()
With your data, it gives as expected:使用您的数据,它可以按预期提供:
server_name Date time CPU Memory error
0 server1 2020-10-25 300 90.2 64.4 0
1 server2 2020-10-25 300 50.4 23.3 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.