[英]Compare each row in one dataframe to each row in another dataframe in Python
[英]Query for one dataframe row based on row in another dataframe & compare values
所以我有两个数据框。 第一个数据帧包含用于“评分”的数字数据,第二个数据帧包含模拟数据。
df1 = 基本记录
df2 = 模拟记录
第 1 部分:我要完成的是查询 df1“基本记录”以找到与 df2“模拟记录”中“名称”和“时间”列完全匹配的时间戳最近的行。
第 2 部分:然后我想使用 if then function 来确定模拟记录行中的值是否落在使用基本记录行中的两个值创建的范围之间,并返回 boolean。
低范围 = df1['Po']-df1['Ref']
高范围 = df1['Po']+df1['Ref']
如果 df2['Sim'] 介于其最近的 df1 基本记录的低范围和高范围之间,那么我想在新列“Sim Score”中返回 true,否则返回 false
第 3 部分:我想为模拟记录中的每一行重复第 1 部分和第 2 部分。
有用的信息:
df1 base records example (columns that matter)
Timestamp Name Time Po Ref
7/11/2022 11:30:00 trial 20 mins 5 2
7/10/2022 04:00:00 trial 20 mins 4 4
7/09/2022 02:45:00 trial 20 mins 2 2
6/28/2022 03:45:00 trial 20 mins 3 6
df2 simulation records example (columns that matter)
Timestamp Name Time Sim
7/10/2022 05:15:00 trial 20 mins 7
7/11/2022 12:45:00 trial 20 mins 4
7/12/2022 03:30:00 trial 20 mins 8
desired result of new column added to df2
Timestamp Name Time Sim Sim Score
7/10/2022 05:15:00 trial 20 mins 7 True
7/11/2022 12:45:00 trial 20 mins 4 True
7/12/2022 03:30:00 trial 20 mins 8 False
因为您不提供构建 dataframe 的代码,所以我将绘制一个潜在的解决方案:
首先,我假设您每天只有一个时间戳(在您的示例中看起来像这样)。 因此,我会截断或拆分时间戳,使其仅在一列中包含日期。 这样做是为了我们可以根据日期加入数据帧,即对两个数据帧使用set_index("date_column")
(使用内部连接仅保留两个数据帧中存在日期的行)。 最后,您可以使用apply()
检查您的情况:
df_joined['Sim Score'] = df_joined.apply(lambda row: (row['Po']-row['Ref'] <= row['Sim']) and (row['Po']+row['Ref'] >= row['Sim']), axis = 1)
使用pandas.DataFrame.reindex
,它的method
提供最近找到可计算索引(例如,字符串不能计算距离)
或者使用merge_asof
,它的direction
提供最近的。
reindex()
with method='nearest'
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
print(df1)
###
Name Time Po Ref l_r h_r
Timestamp
2022-07-11 11:30:00 trial 20 mins 5 2 3 7
2022-07-10 04:00:00 trial 20 mins 4 4 0 8
2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
df2.set_index('Timestamp', inplace=True)
print(df2)
###
Name Time Sim
Timestamp
2022-07-10 05:15:00 trial 20 mins 7
2022-07-11 12:45:00 trial 20 mins 4
2022-07-12 03:30:00 trial 20 mins 8
temp = df2.join(df1.reindex(df2.index, method='nearest'), lsuffix='_left', rsuffix='_right')
print(temp)
如您所见,这是df2.join(df1)
,
一次按索引连接多个 DataFrame 对象。
使用method='nearest'
,在这种情况下,它将通过最近的Timestamp
索引加入df2
和df1
。
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
df2.reset_index(inplace=True)
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
merge_asof()
with direction='nearest'
这种方式不会使用索引值执行,因此我们不必将列Timestamp
设置为索引。 但它需要对绑定对象(在这种情况下我们合并列Timestamp
)进行排序。
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
# df1.set_index('Timestamp', inplace=True)
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
3 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
2 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
1 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
df2['Timestamp'] = pd.to_datetime(df2['Timestamp'])
# df2.set_index('Timestamp', inplace=True)
df2.sort_values(by='Timestamp', inplace=True)
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8
temp = pd.merge_asof(df2 ,df1[['Timestamp', 'l_r', 'h_r']], on='Timestamp', direction='nearest')
print(temp)
如您所见,这是pd.merge_asof(df2, df1)
,
这类似于左连接,除了我们匹配最近的键而不是相等的键。 两个 DataFrame 都必须按 key 排序。
对于左侧 DataFrame 中的每一行:
“最近”搜索选择右侧 DataFrame 中的行,其“开”键与左侧键的绝对距离最近。
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
坦率地说,如果你有一个大数据集,处理索引的东西会更快。
我重新修改了df1
添加了不同的名称和时间
df1 = pd.DataFrame({'Timestamp':['7/11/2022 11:30:00','7/11/2022 11:30:00','7/10/2022 04:00:00','7/10/2022 04:00:00','7/09/2022 02:45:00','6/28/2022 03:45:00'],
'Name':['trial','trial','trial','non-trial','trial','trial'],
'Time':['20 mins','30 mins','20 mins','20 mins','20 mins','20 mins'],
'Po':[5, 6, 4, 1, 2, 3],
'Ref':[2, 2, 4, 3, 2, 6]})
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['l_r'] = df1['Po'] - df1['Ref']
df1['h_r'] = df1['Po'] + df1['Ref']
df1.sort_values(by='Timestamp', inplace=True)
print(df1)
###
Timestamp Name Time Po Ref l_r h_r
5 2022-06-28 03:45:00 trial 20 mins 3 6 -3 9
4 2022-07-09 02:45:00 trial 20 mins 2 2 0 4
2 2022-07-10 04:00:00 trial 20 mins 4 4 0 8
3 2022-07-10 04:00:00 non-trial 20 mins 1 3 -2 4
0 2022-07-11 11:30:00 trial 20 mins 5 2 3 7
1 2022-07-11 11:30:00 trial 30 mins 6 2 4 8
print(df2)
###
Timestamp Name Time Sim
0 2022-07-10 05:15:00 trial 20 mins 7
1 2022-07-11 12:45:00 trial 20 mins 4
2 2022-07-12 03:30:00 trial 20 mins 8
只能在单个键上 merge_asof,因此其他人会使用on=
来处理。
temp = pd.merge_asof(df2, df1[['Timestamp', 'Name', 'Time', 'l_r', 'h_r']], on='Timestamp', by=['Name','Time'], direction='nearest')
print(temp)
df2['Sim Score'] = temp['Sim'].between(temp['l_r'], temp['h_r']).values
print(df2)
###
Timestamp Name Time Sim Sim Score
0 2022-07-10 05:15:00 trial 20 mins 7 True
1 2022-07-11 12:45:00 trial 20 mins 4 True
2 2022-07-12 03:30:00 trial 20 mins 8 False
您可以通过pandasql
来做到这一点:但请注意,您最好向其中一列添加唯一约束(例如,一些试验)
from pandasql import sqldf
df3 = sqldf('''
select df2.Timestamp, df2.Name, df2.Time, df2.Sim,
case
when Sim >= (df1.Po - df1.Ref) and Sim <= (df1.Po + df1.Ref) then 'True'
when Sim < (df1.Po - df1.Ref) or Sim > (df1.Po + df1.Ref) then 'False'
end as 'Sim Score'
from df1, df2
where df2.Name == df1.Name and df2.Time == df1.Time
''')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.