pandas 通過跨列計算來確定一行中的缺失值

Question

我正在處理銷售數據。 下面是數據框的截斷版本。 目標是確定“9999”的實際值並替換單元格。

    CUSTOMER_ID A ALL_Sales_2017  Toyota_sales_2017  Honda_sales_2017  Ford_sales_2017  **9999_count**
     3000522            93              9999                70                  20          1
     3000530            60              31                  9999                27          1
     3002817            231             9999                43                  170         1
     3004201            18              6                   9999                9999        2
     3004573            36              9999                18                  17          1
     3004888            9               9999                9999                9999        3

在上面的數據集中，“ALL_Sales_2017”列表示總銷售額，可以假設它始終具有真實值（非 9999）。 而各個列（“Toyota_sales_2017”、“Honda_sales_2017”、“Ford_sales_2017”）可能包含一個、兩個或三個 9999 值。

確定給定行中的“9999”的邏輯是

包含一個 9999 的行，例如 customer_id=3000522 是 93-(70+20)=3
包含兩個 9999 的行，例如 customer_id=3004201 是 (18-6)/2=6
包含三個 9999 的行，忽略該行

所以，數據集后處理看起來像這樣

    CUSTOMER_ID A ALL_Sales_2017  Toyota_sales_2017  Honda_sales_2017  Ford_sales_2017  **9999_count**
    3000522            93              3                   70                  20          1
    3000530            60              31                  2                   27          1
    3002817            231             18                  43                  170         1
    3004201            18              6                   6                   6           2
    3004573            36              1                   18                  17          1
    3004888            9               9999                9999                9999        3

我想出了這個實現。

創建一個新列 ( 9999_count ) 以跟蹤每行中 9999 的數量。
計算並分配適當列中的值

df.loc[
    df["Toyota_sales_2017"].eq(9999) & (df["9999_count"] == 1), "Toyota_sales_2017"
] = (df["ALL_Sales_2017"] - df["Honda_sales_2017"] - df["Ford_sales_2017"])
df.loc[
    df["Honda_sales_2017"].eq(9999) & (df["9999_count"] == 1), "Honda_sales_2017"
] = (df["ALL_Sales_2017"] - df["Toyota_sales_2017"] - df["Ford_sales_2017"])
df.loc[df["Ford_sales_2017"].eq(9999) & (df["9999_count"] == 1), "Ford_sales_2017"] = (
    df["ALL_Sales_2017"] - df["Honda_sales_2017"] - df["Toyota_sales_2017"]
)

如何將此邏輯擴展到更多列，例如 2018 年、2019 年、2020 年等。我們能否以通用方式重寫邏輯？ 有沒有另一種可能更簡單的方法來解決這個問題？

Answer 1

您沒有提供可重現的示例，因此我無法自行檢查，但這是一種概括方法，應該可以使用 Python f-strings和 Pandasconcat ：

dfs = []
for year in range(2017, 2022):  # years 2017 to 2021

    # Get the subset of df for the given year
    tmp = df.loc[
        :, ["CUSTOMER_ID"] + [col for col in df.columns if str(year) in col]
    ]

    # Create a temporary column (9999_count) to track the number of 9999s in each row
    tmp["9999_count"] = tmp.apply(
        lambda x: sum([x[col] == 9999 for col in tmp.columns]), axis=1
    )

    # Change values
    tmp.loc[
        tmp[f"Toyota_sales_{year}"].eq(9999) & (tmp["9999_count"] == 1),
        f"Toyota_sales_{year}",
    ] = (
        tmp[f"ALL_Sales_{year}"]
        - tmp[f"Honda_sales_{year}"]
        - tmp[f"Ford_sales_{year}"]
    )
    tmp.loc[
        tmp[f"Honda_sales_{year}"].eq(9999) & (tmp["9999_count"] == 1),
        f"Honda_sales_{year}",
    ] = (
        tmp[f"ALL_Sales_{year}"]
        - tmp[f"Toyota_sales_{year}"]
        - tmp[f"Ford_sales_{year}"]
    )
    tmp.loc[
        tmp[f"Ford_sales_{year}"].eq(9999) & (tmp["9999_count"] == 1),
        f"Ford_sales_{year}",
    ] = (
        tmp[f"ALL_Sales_{year}"]
        - tmp[f"Honda_sales_{year}"]
        - tmp[f"Toyota_sales_{year}"]
    )

    dfs.append(tmp.drop(columns="9999_count"))

# Get back full dataframe
df = pd.concat(dfs, axis=1)

pandas 通過跨列計算來確定一行中的缺失值

問題描述

1 個解決方案

解決方案1
0 已采納 2022-12-17 18:18:56

pandas 通過跨列計算來確定一行中的缺失值

問題描述

1 個解決方案

解決方案1 0 已采納 2022-12-17 18:18:56

解決方案1
0 已采納 2022-12-17 18:18:56