如何使用 Pandas 在一個循環中多次過濾 dataframe （多個條件和一對多 dataframe 結果）

Question

稍后我會展示一段代碼，問題如下：

我有一個 dataframe，以及一個包含該 dataframe 的一些列的列表。 我需要獲取這些列的所有不同值，存儲它們，並為原始 dataframe 中這些不同值的每個組合創建一個唯一的 dataframe。 然后，將這些數據幀導出到 excel（沒問題）。 例如：

示例表

該表將轉換為 dataframe，假設列的列表是 ['OS', 'Work']。 最后，我將有一個字典，其中每列作為鍵，每個不同的值作為該鍵的一組值，如下所示：

data = {'OS': {'IOS', 'Linux', 'Windows'}, '工作': {'Developer', 'CEO', 'Administrator', 'Engineer'}}

現在問題來了（以及我將展示的代碼塊）。 我需要根據這些值的組合過濾 dataframe，例如：

Dataframe 1) IOS + Developer ---> 將只有在 OS 列中具有 IOS 和在 Work 列中具有 Developer 的所有行

Dataframe 2) IOS + CEO ---> 將只有在 OS 列中具有 IOS 和在工作列中具有 CEO 的所有行

重要的是要注意，我不知道將輸入哪些列或 dataframe，這意味着它可以是任意數量的列，具有任意數量的不同值，並且該算法應該適用於所有情況

這是我到目前為止的代碼：

# data is the dictionary with the values as shown, it will automatically get all
# the columns and distinct values, for any number of columns and any dataframe

# column_name is the name of the column that I'm about to filter, and N is the condition
# (for example, df['OS'] == 'Linux' will only take rows that have Linux in that column

for N in data:
    out = path + f'{name}({N})'
    df_aux = df[df[column_name] == N]
    with pandas.ExcelWriter(out) as writer:
        #... and it exports the dataframe to an excel .xlsx file

# this works for one column (working with a string and a set instead of a dictionary),
# but I have this (failure) for multiple columns

for col in data:
    for N in data[col]:
        #... and then filter with
        df_aux = df[df[col] == N]

#...and then export it to excel file in this level of indentation

我嘗試了不同級別的縮進，使用多維數組而不是字典，使用有序字典，......最后，我真的不知道如何使循環工作，這是核心問題。 我現在的想法是制作一個具有不同列值的 dataframe，並簡單地讓所有不同的可能性穿過 dataframe，但我仍然不知道如何進行循環，因為我不知道如何以任意數量的條件過濾原始 dataframe。

感謝您的幫助，我知道這是一個有點冗長的問題。

Answer 1

這可以使用來自 pandas 的groupby function 來解決。 具有任意列的輸入數據的 Function 可能如下所示：

def create_dataframes_by_columns(data, columns_to_group_by):
    dataframes = []
    for name, group in data.groupby(columns_to_group_by):
        dataframes.append(group)
        
    unique_values = {col: pd.unique(df[col]).tolist() for col in columns_to_group_by}
    
    return unique_values, dataframes

這將返回兩個值：您分組的列的唯一值字典和數據框列表，每個數據框僅包含具有columns_to_groupby中的一種值組合的元素。

如果您想將每個 dataframe 保存到 excel 文件中，您可以執行以下操作（完全可重現的示例）：

df = pd.DataFrame({
    'name': [
        'Maria',
        'Ana',
        'Gabriel',
        'Marcos',
        'Ana',
        'Joaquin',
        'Alberto',
        'Maria',
        'Marta',
        'Belen'
    ],
    'work': [
        'Developer',
        'Administrator',
        'CEO',
        'Engineer',
        'Developer',
        'Developer',
        'Administrator',
        'CEO',
        'Developer',
        'Engineer'
    ],
    'OS': [
        'IOS',
        'Linux',
        'Linux',
        'Windows',
        'Linux',
        'Windows',
        'IOS',
        'IOS',
        'Windows',
        'Windows'
    ]
})
columns_to_group_by = ['work', 'OS']

for name, group in df.groupby(columns_to_group_by):
    filename_parts = ['data']
    for colname in name:
        filename_parts.append(colname)
    save_path = '_'.join(filename_parts) + '.xlsx'
    group.to_excel(save_path)

groupby 中的值“名稱”是一個包含來自給定組的唯一值的元group ，我使用這些值來創建 excel 文件名。

如何使用 Pandas 在一個循環中多次過濾 dataframe （多個條件和一對多 dataframe 結果）

問題描述

1 個解決方案

解決方案1
0 已采納 2022-09-22 10:37:20

如何使用 Pandas 在一個循環中多次過濾 dataframe （多個條件和一對多 dataframe 結果）

問題描述

1 個解決方案

解決方案1 0 已采納 2022-09-22 10:37:20

解決方案1
0 已采納 2022-09-22 10:37:20