简体   繁体   English

比较两个数据框,然后根据另一个将新列添加到其中一个数据框

[英]Compare two dataframes, and then add new column to one of the data frames based on the other

I need to be able to compare two dataframes, one with one column, and one with two columns, like this: 我需要能够比较两个数据帧,一个带有一列,一个带有两列,如下所示:

import numpy as np
import pandas as pd

df_1 = pd.DataFrame(columns=list('AB'))
df_1['A'] = np.random.randint(00,99,size=(5))

df_2  = pd.DataFrame(columns=list('XY'))
df_2['X'] = np.arange(0,100,0.1)
df_2['Y'] = np.cos(df_2['X']) + 30

Now, I want to compare df_1['A'] and df_2['X'] to find matching values, and then create a second column in df_1 (aka df_1['B']) with a value from df_2['Y'] that corresponds to the matching df_2['X'] value. 现在,我想比较df_1 ['A']和df_2 ['X']来找到匹配的值,然后在df_1中创建第二列(aka df_1 ['B']),并使用df_2 ['Y' ],该值对应于匹配的df_2 ['X']值。 Does anyone have a solution? 有没有人有办法解决吗?

If there isn't an exact matching value between the first two columns of the dataframes, is there a way to match the next closest value (with a threshold of ~5%)? 如果数据帧的前两列之间没有精确匹配的值,是否有办法匹配下一个最接近的值(阈值约为5%)?

As mentioned in the OP, you may want to also capture the closest value to the df_1['A'] list if there is not an exact match in df_2['X']...to do this, you can try the following: 如OP中所述,如果df_2 ['X']中不存在完全匹配的内容,您可能还希望捕获与df_1 ['A']列表最接近的值...为此,您可以尝试以下操作:

define your dfs as per OP: 根据OP定义df:

df_1 = pd.DataFrame(columns=list('AB'))
df_1['A'] = np.random.randint(00,99,size=(5))

df_2  = pd.DataFrame(columns=list('XY'))
df_2['X'] = np.arange(0,100,0.1)
df_2['Y'] = np.cos(df_2['X']) + 30 #changed "line_x"

first define a function which will find the closest value: 首先定义一个函数,该函数将找到最接近的值:

import numpy as np    
def find_nearest(df, in_col, value, out_col): #args = input df (df_2 here), column to match against ('X' here), value to match in in_col (values in df_1['A'] here), column with data you want ('Y' here)
    array = np.asarray(df[in_col])
    idx = (np.abs(array - value)).argmin()
    return df.iloc[idx][out_col]

then get all the df_2['Y'] values you want: 然后获取所需的所有df_2 ['Y']值:

matching_vals=[] #declare empty list of matching values from df_2['Y'] to add to df_1['B']
for A in df_1['A'].values: #loop through all df_1['A'] values
    if A in df_2['X']: # if exact match
        matching_vals.append(float(df_2[df_2['X']==A]['Y'])) #append corresponding df_2['Y'] value to list
    else: #no exact match
        matching_vals.append(find_nearest(df_2,'X',A,'Y')) #append df_2['Y'] value with closest match in df_2['X'] column

finally, add it to the original df_1: 最后,将其添加到原始df_1中:

df_1['B']=matching_vals

This example works for the dfs that you have provided, but you may have to fiddle slightly with the steps to work with your real data... 该示例适用于您提供的dfs,但是您可能不得不稍微花一些时间来处理真实数据...

you can also add one more if statement if you want to enforce the 5% threshold rule..and if it doesn't pass, just append nan to the list (or whatever works best for you) 如果您要强制执行5%阈值规则,也可以再添加一个if语句。如果不通过,只需将nan附加到列表中(或最适合您的方法)

df_2.merge(df_1, left_on=['X'], right_on=['A']).rename({'Y':'B', axis='columns')

在将“ Y”重命名为“ B”之后,合并过滤器会过滤df_1['A']df_2['X']之间的公共值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM