比較季度數據：在 Python(Pandas) 中迭代以比較來自四個不同 excel 文件的多列，這些文件導入為 dataframe

Question

親愛的 Stackoverflow 社區，我有一個 excel 文件"big_excel.xlsx" ，它由四列組成，即"date_column" 、 "efficacy" 、 "composition"和"testgroups" 。 基本上，我已經將這個 excel 每季度拆分一次“q1..q4 ”，這樣我就可以將每列中的值與我從 4 個不同來源收到的 4 個不同的 Excel 進行比較，這些數據應該是 100% 相同的。 發件人的優勢在於元素已經以這樣的方式排序，它應該與每季度拆分的 excel 完全匹配。 我的代碼在q1 季度完美運行。 為了比較，我使用了“.equals”，因為它可以有 nans。 現在我必須對剩余的季度q2..q4應用相同的代碼概念。

import pandas as pd
from os.path import expanduser as ospath
import numpy as np


df = pd.read_excel(ospath('big_excel.xlsx'))

df.date_column = pd.to_datetime(df.date_column)

df['quarters'] = df.date_column.dt.quarter

q1 = df[df.quarters == 1]

q2 = df[df.quarters == 2].reset_index(drop=True)

q3 = df[df.quarters == 3].reset_index(drop=True)

q4 = df[df.quarters == 4].reset_index(drop=True)


test_excel_q1 = pd.read_excel(ospath('from_biontech.xlsx'))

test_excel_q2 = pd.read_excel(ospath('from_astrazeneca.xlsx'))

test_excel_q3 = pd.read_excel(ospath('from_sputnik.xlsx'))

test_excel_q4 = pd.read_excel(ospath('from_moderna.xlsx'))




q1['compare_date_column'] = np.where(q1[q1.columns[1]].equals(test_excel_q1[test_excel_q1.columns[1]]), 'True', 'False')  
q1['compare_efficacy'] = np.where(q1[q1.columns[2]].equals(test_excel_q1[test_excel_q1.columns[2]]), 'True', 'False')
q1['compare_composition'] = np.where(q1[q1.columns[3]].equals(test_excel_q1[test_excel_q1.columns[3]]), 'True', 'False')
q1['compare_testgroups'] = np.where(q1[q1.columns[4]].equals(test_excel_q1[test_excel_q1.columns[4]]), 'True', 'False')

為此，我顯然可以在q1['compare_date_column'] 、 q1['compare_efficacy'] 、 q1['compare_composition']和q1['compare_testgroups']中更改q1-> q2,q3,q4 ，然后復制和粘貼. 但是，這是一個骯臟的解決方案，如果我將來增加列，我會很困惑。 所以，我想知道我的問題是否可以通過迭代來解決。

我的想法：創建一個變量列表var_list = [q1,q2,q3,q4] ，其中對於 var_list 中的每個索引，它采用索引 i 並迭代地替換它

q1['compare_date_column'] = np.where(q1[q1.columns[1]].equals(test_excel_q1[test_excel_q1.columns[1]]), 'True', 'False')  
q1['compare_efficacy'] = np.where(q1[q1.columns[2]].equals(test_excel_q1[test_excel_q1.columns[2]]), 'True', 'False')
q1['compare_composition'] = np.where(q1[q1.columns[3]].equals(test_excel_q1[test_excel_q1.columns[3]]), 'True', 'False')
q1['compare_testgroups'] = np.where(q1[q1.columns[4]].equals(test_excel_q1[test_excel_q1.columns[4]]), 'True', 'False')

我是否需要為此定義一個 function，如果是，任何人都可以幫助我，因為我仍在學習 python。 我將非常感謝您為我提供的任何意見。 非常感謝您的時間和精力。

Answer 1

一種方法可能是定義一個 function ，它需要四分之一 dataframe 和相應的測試 dataframe 為該季度並返回原始數據幀與比較列。 就像是：

# you can also use this function to compare just one quarter
def compare_quarter(df_q:pd.DataFrame, df_test_q:pd.DataFrame):
    # this do exactly the same as your 4 comparing code lines
    df_q[[
        'compare_date_column',
        'compare_efficacy',
        'compare_composition',
        'compare_testgroups'
    ]] = \
        [np.where(df_q.iloc[:, i].equals(df_test_q.iloc[:, i]), 'True', 'False') for i in range(1,5)]

    return df_q

然后你只需在季度中迭代 function：

for q, t in zip([q1, q2, q3, q4], [test_excel_q1, test_excel_q2, test_excel_q3, test_excel_q4]):
    q = compare_quarter(q, t)

注意：我注意到當您比較每一列時，您是在比較整個季度和測試列。 這意味着：如果只有一行不同，則整個compare_column （所有行）將為False 。 如果要按元素進行比較，請在 function 中使用eq方法，例如：

def compare_quartals(df_q:pd.DataFrame, df_test_q:pd.DataFrame):
    comp_cols = [
        'compare_date_column',
        'compare_efficacy',
        'compare_composition',
        'compare_testgroups'
    ]

    for i in range(1,5):
        df_q[comp_cols[i-1]] = df_q.iloc[:, i].eq(df_test_q.iloc[:, i])

    return df_q

比較季度數據：在 Python(Pandas) 中迭代以比較來自四個不同 excel 文件的多列，這些文件導入為 dataframe

問題描述

1 個解決方案

解決方案1
1 已采納 2021-06-12 14:26:35

比較季度數據：在 Python(Pandas) 中迭代以比較來自四個不同 excel 文件的多列，這些文件導入為 dataframe

問題描述

1 個解決方案

解決方案1 1 已采納 2021-06-12 14:26:35

解決方案1
1 已采納 2021-06-12 14:26:35