這個 function 可以優化運行得更快嗎？

Question

我有一個 300 萬行的 dataframe。 我需要轉換列中的值。 該列包含用“;”連接在一起的字符串。 轉換涉及將字符串分解為其組件，然后根據某些優先級規則選擇其中一個字符串。

這是示例數據集和 function：

data = {'Name': ['X1', 'X2', 'X3', 'X4', 'X5','X6'], 'category': ['CatA;CatB', 'CatB', None, 'CatB;CatC;CatA', 'CatA;CatB', 'CatB;CatD;CatB;CatC;CatA']} 

sample_dataframe = pd.DataFrame(data) 

def cat_name(x):
    if x:
        x =  pd.Series(x.split(";"))
        y = x[(x!='CatA') & x.notna()]
        custom_dict = {'CatC': 0, 'CatD':1, 'CatB': 2, 'CatE': 3}
        if x.count() == 1:
            return x.iloc[0]
        elif y.count() > 1:
            y = y.sort_values(key=lambda x: x.map(custom_dict))
            if y.count() > 2:
                return '3 or more'
            else:
                return y.iloc[0]+'+'
        elif y.count() == 1:
            return y.iloc[0]
        else:
            return None
    else:
        return None

我正在使用應用方法test_data = sample_dataframe['category'].apply(cat_name)在列上運行 function。 對於我的 300 萬行數據集，運行 function 需要將近 10 分鍾。

如何優化 function 以更快地運行？

另外，我有兩組類別規則，output 類別需要存儲在兩列中。 目前我正在使用兩次應用 function。 有點愚蠢和緩慢，我知道，但它有效。

有沒有辦法為不同的優先級字典同時運行 function 並返回兩個 output 值？ 我嘗試將test_data['CAT_NAME'], test_data['MAIN_CAT_NAME']=zip(*sample_dataframe['category'].apply(joint_cat_name))與 function 一起使用

def joint_cat_name(x):
    cat_string = x
    if cat_string:
        string_series =  pd.Series(cat_string.split(";"))
        y = string_series[(string_series!='CatA') & string_series.notna()]
        custom_dict = {'CatB': 0, 'CatC':1, 'CatD': 2, 'CatE': 3}
        if string_series.count() == 1:
            return string_series.iloc[0], string_series.iloc[0]
        elif y.count() > 1:
            y = y.sort_values(key=lambda x: x.map(custom_dict))
            if y.count() > 2:
                return '3 or more', y.iloc[0]
            elif y.count() == 1:
                return y.iloc[0]+'+', y.iloc[0]
        elif y.count() == 1:
            return y.iloc[0], y.iloc[0]
        else:
            return None, None
    else:
        return None, None

但是當 zip function 遇到包含 Nones 的元組時，我收到一個錯誤TypeError: 'NoneType' object is not iterable 。 即，當 output 為(None, None)時，它引發了錯誤

提前非常感謝。

Answer 1

您的 function 做了很多不必要的工作。 即使您只是重新排序一些條件，它也會運行得更快。

custom_dict = {"CatC": 0, "CatD": 1, "CatB": 2, "CatE": 3}
def cat_name(x):
    if x is None:
        return x
    xs = x.split(";")
    if len(xs) == 1:
        return xs[0]
    ys = [x for x in xs if x != "CatA"]
    l = len(ys)
    if l == 0:
        return None
    if l == 1:
        return ys[0]
    if l == 2:
        return min(ys, key=lambda k: custom_dict[k]) + "+"
    if l > 2:
        return "3 or more"

Answer 2

Faster than running one Python method on each row might be to go through your dataframe multiple times, and each time use an optimized Pandas query. 你必須像這樣重寫你的代碼：

# select empty categories
no_cat = sample_dataframe['category'].isna()

# select categorie strings with only one category
single_cat = ~no_cat & (sample_dataframe['category'].str.count(";") == 0)

# get number of categories
num_cats = sample_dataframe['category'].str.count(";") + 1
three_or_more = num_cats > 2

# has a "CatA" category
has_cat_A = sample_dataframe['category'].str.contains("CatA", na=False)

# then also write these selected rows in a custom way
sample_dataframe["cat_name"] = ""
cat_name_col = sample_dataframe["cat_name"]
cat_name_col[no_cat] = None
cat_name_col[single_cat] = sample_dataframe["category"][single_cat]
cat_name_col[three_or_more] = "3 or more"

# continue with however complex you want to get to cover more cases, e.g.
two_cats_no_cat_A = (num_cats == 2) & ~has_cat_A

# then handle only the remaining cases with the apply
not_handled = ~no_cat & ~single_cat & ~three_or_more
cat_name_col[not_handled] = sample_dataframe["category"][not_handled].apply(cat_name)

在 300 萬行上運行這些查詢應該非常快，即使您必須執行其中一些並將它們組合起來。 如果它仍然太慢，您可以以相同的矢量化方式處理來自應用程序的更多特殊情況。

這個 function 可以優化運行得更快嗎？

問題描述

2 個解決方案

解決方案1
1 2022-09-27 12:21:43

解決方案2
0 2022-09-27 13:03:50

這個 function 可以優化運行得更快嗎？

問題描述

2 個解決方案

解決方案1 1 2022-09-27 12:21:43

解決方案2 0 2022-09-27 13:03:50

解決方案1
1 2022-09-27 12:21:43

解決方案2
0 2022-09-27 13:03:50