这个 function 可以优化运行得更快吗？

Question

我有一个 300 万行的 dataframe。 我需要转换列中的值。 该列包含用“;”连接在一起的字符串。 转换涉及将字符串分解为其组件，然后根据某些优先级规则选择其中一个字符串。

这是示例数据集和 function：

data = {'Name': ['X1', 'X2', 'X3', 'X4', 'X5','X6'], 'category': ['CatA;CatB', 'CatB', None, 'CatB;CatC;CatA', 'CatA;CatB', 'CatB;CatD;CatB;CatC;CatA']} 

sample_dataframe = pd.DataFrame(data) 

def cat_name(x):
    if x:
        x =  pd.Series(x.split(";"))
        y = x[(x!='CatA') & x.notna()]
        custom_dict = {'CatC': 0, 'CatD':1, 'CatB': 2, 'CatE': 3}
        if x.count() == 1:
            return x.iloc[0]
        elif y.count() > 1:
            y = y.sort_values(key=lambda x: x.map(custom_dict))
            if y.count() > 2:
                return '3 or more'
            else:
                return y.iloc[0]+'+'
        elif y.count() == 1:
            return y.iloc[0]
        else:
            return None
    else:
        return None

我正在使用应用方法test_data = sample_dataframe['category'].apply(cat_name)在列上运行 function。 对于我的 300 万行数据集，运行 function 需要将近 10 分钟。

如何优化 function 以更快地运行？

另外，我有两组类别规则，output 类别需要存储在两列中。 目前我正在使用两次应用 function。 有点愚蠢和缓慢，我知道，但它有效。

有没有办法为不同的优先级字典同时运行 function 并返回两个 output 值？ 我尝试将test_data['CAT_NAME'], test_data['MAIN_CAT_NAME']=zip(*sample_dataframe['category'].apply(joint_cat_name))与 function 一起使用

def joint_cat_name(x):
    cat_string = x
    if cat_string:
        string_series =  pd.Series(cat_string.split(";"))
        y = string_series[(string_series!='CatA') & string_series.notna()]
        custom_dict = {'CatB': 0, 'CatC':1, 'CatD': 2, 'CatE': 3}
        if string_series.count() == 1:
            return string_series.iloc[0], string_series.iloc[0]
        elif y.count() > 1:
            y = y.sort_values(key=lambda x: x.map(custom_dict))
            if y.count() > 2:
                return '3 or more', y.iloc[0]
            elif y.count() == 1:
                return y.iloc[0]+'+', y.iloc[0]
        elif y.count() == 1:
            return y.iloc[0], y.iloc[0]
        else:
            return None, None
    else:
        return None, None

但是当 zip function 遇到包含 Nones 的元组时，我收到一个错误TypeError: 'NoneType' object is not iterable 。 即，当 output 为(None, None)时，它引发了错误

提前非常感谢。

Answer 1

您的 function 做了很多不必要的工作。 即使您只是重新排序一些条件，它也会运行得更快。

custom_dict = {"CatC": 0, "CatD": 1, "CatB": 2, "CatE": 3}
def cat_name(x):
    if x is None:
        return x
    xs = x.split(";")
    if len(xs) == 1:
        return xs[0]
    ys = [x for x in xs if x != "CatA"]
    l = len(ys)
    if l == 0:
        return None
    if l == 1:
        return ys[0]
    if l == 2:
        return min(ys, key=lambda k: custom_dict[k]) + "+"
    if l > 2:
        return "3 or more"

Answer 2

Faster than running one Python method on each row might be to go through your dataframe multiple times, and each time use an optimized Pandas query. 你必须像这样重写你的代码：

# select empty categories
no_cat = sample_dataframe['category'].isna()

# select categorie strings with only one category
single_cat = ~no_cat & (sample_dataframe['category'].str.count(";") == 0)

# get number of categories
num_cats = sample_dataframe['category'].str.count(";") + 1
three_or_more = num_cats > 2

# has a "CatA" category
has_cat_A = sample_dataframe['category'].str.contains("CatA", na=False)

# then also write these selected rows in a custom way
sample_dataframe["cat_name"] = ""
cat_name_col = sample_dataframe["cat_name"]
cat_name_col[no_cat] = None
cat_name_col[single_cat] = sample_dataframe["category"][single_cat]
cat_name_col[three_or_more] = "3 or more"

# continue with however complex you want to get to cover more cases, e.g.
two_cats_no_cat_A = (num_cats == 2) & ~has_cat_A

# then handle only the remaining cases with the apply
not_handled = ~no_cat & ~single_cat & ~three_or_more
cat_name_col[not_handled] = sample_dataframe["category"][not_handled].apply(cat_name)

在 300 万行上运行这些查询应该非常快，即使您必须执行其中一些并将它们组合起来。 如果它仍然太慢，您可以以相同的矢量化方式处理来自应用程序的更多特殊情况。

这个 function 可以优化运行得更快吗？

问题描述

2 个解决方案

解决方案1
1 2022-09-27 12:21:43

解决方案2
0 2022-09-27 13:03:50

这个 function 可以优化运行得更快吗？

问题描述

2 个解决方案

解决方案1 1 2022-09-27 12:21:43

解决方案2 0 2022-09-27 13:03:50

解决方案1
1 2022-09-27 12:21:43

解决方案2
0 2022-09-27 13:03:50