用列上的条件逐个单元比较两个数据帧

Question

I want to compare two dataframe and output a dataframe with its differences.我想比较两个数据帧并输出一个具有差异的数据帧。 However, I can tolerate date difference within 2 days differences, and score within 5 points difference.但是，我可以容忍 2 天差异内的日期差异，并在 5 分差异内得分。 I will keep the values from df1 if they are within the acceptable ranges.如果它们在可接受的范围内，我将保留来自 df1 的值。

df1 df1

id    group      date        score
10     A       2020-01-10     50
29     B       2020-01-01     80
39     C       2020-01-21     84
38     A       2020-02-02     29

df2 df2

id    group      date        score
10     B       2020-01-11     56
29     B       2020-01-01     81
39     C       2020-01-22     85
38     A       2020-02-12     29

My expected output :我的预期输出：

id    group           date                      score
10     A -> B       2020-01-10                50 -> 56
29     B            2020-01-01                   80
39     C            2020-01-21                   84
38     A            2020-02-02 -> 2020-02-12     29

Thus, I want to compare the dataframe cell by cell and condition on certain columns.因此，我想在某些列上逐个单元格和条件比较数据帧单元格。

I started on this :我开始了这个：

df1.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
result = []
for col in df1.columns:
    for index, row in df1.iterrows():
        diff = []
        compare_item = row[col][index]
        for index, row in df2.iterrows():
            if col == 'date':
                # acceptable if it's within 2 days differences
            if col == 'score':
                # acceptable if it's within 5 points differences
            if compare_item == row[col][index]:
                diff.append(compare_item)
            else:
                diff.append('{} --> {}'.format(compare_item, row[col]))
    result.append(diff)
df = pd.DataFrame(result, columns = [df1.columns])

Answer 1

Let's try:咱们试试吧：

thresh = {'date':pd.to_timedelta('2D'),
          'score':5}

def update(col):
    name = col.name

    # if there is a threshold, we update only if threshold is surpassed
    if name in thresh:
        return col.where(col.sub(df2[name]).abs()<=thresh[name], df2[name])

    # there is no threshold for the column
    # return the corresponding column from df2
    return df2[name]

df1.apply(update)

Output:输出：

   group       date  score
id                        
10     B 2020-01-10     56
29     B 2020-01-01     80
39     C 2020-01-21     84
38     A 2020-02-12     29

用列上的条件逐个单元比较两个数据帧

问题描述

1 个解决方案

解决方案1
0 2020-10-01 18:44:33

用列上的条件逐个单元比较两个数据帧

问题描述

1 个解决方案

解决方案1 0 2020-10-01 18:44:33

解决方案1
0 2020-10-01 18:44:33