将具有多个返回值的矢量化函数应用到 Pandas 数据帧

Question

I have a dataframe that contains a column holding 'Log' strings.我有一个数据框，其中包含一个包含“日志”字符串的列。 I'd like to create a new column based on the values I've parsed from the 'Log' column.我想根据我从“日志”列解析的值创建一个新列。 Currently, I'm using .apply() with the following function:目前，我将.apply()与以下功能一起使用：

def classification(row):
    if 'A' in row['Log']:
        return 'Situation A'
    elif 'B' in row['Log']:
        return 'Situation B'
    elif 'C' in row['Log']:
        return 'Situation C'
    return 'Check'

it looks like: df['Classification'] = df.apply(classification, axis=1) The issue is that it takes a lot of time (~3min to a dataframe with 4M rows) and I'm looking for a faster way.它看起来像： df['Classification'] = df.apply(classification, axis=1)问题是它需要很多时间（大约 3 分钟到具有 4M 行的数据框），我正在寻找一种更快的方法. I saw some examples of users using vectorized functions that run much faster but those don't have if statements in the function.我看到一些用户使用矢量化函数的例子，这些函数运行得更快，但函数中没有 if 语句。 My question - is it possible to vectorize the function I've added and what is the fastest way to perform我的问题 - 是否可以对我添加的函数进行矢量化以及最快的执行方式是什么
this task?这个任务？

Answer 1

I would not be sure that using a nested numpy.where will increase performance: here some test performace with 4M rows我不确定使用嵌套的numpy.where会提高性能：这里有一些 4M 行的测试性能

import numpy as np
import pandas as pd

ls = ['Abc', 'Bert', 'Colv', 'Dia']
df =  pd.DataFrame({'Log': np.random.choice(ls, 4_000_000)})

df['Log_where'] = np.where(df['Log'].str.contains('A'), 'Situation A', 
                      np.where(df['Log'].str.contains('B'), 'Situation B', 
                          np.where(df['Log'].str.contains('C'), 'Situation C', 'check')))


def classification(x):
    if 'A' in x:
        return 'Situation A'
    elif 'B' in x:
        return 'Situation B'
    elif 'C' in x:
        return 'Situation C'
    return 'Check'


df['Log_apply'] = df['Log'].apply(classification)

Nested np.where Performance嵌套 np.where 性能

 %timeit np.where(df['Log'].str.contains('A'), 'Situation A', np.where(df['Log'].str.contains('B'), 'Situation B',np.where(df['Log'].str.contains('C'), 'Situation C', 'check')))
8.59 s ± 1.71 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Applymap Performance应用地图性能

%timeit df['Log'].apply(classification)
911 ms ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

At least with my machine using nested np.where is almost 10x times slower than a applymap .至少在我的机器上使用嵌套np.where几乎比applymap慢 10 倍。

A final remark : using the solution suggested in the comments, ie something like:最后一句话：使用评论中建议的解决方案，即：

d = {'A': 'Situation A',
     'B': 'Situation B',
     'C': 'Situation C'}
df['Log_extract'] = df['Log'].str.extract('(A|B|C)')
df['Log_extract'] = df['Log_extract'].map(d).fillna('Check')

has the following problems:有以下问题：

It won't necessarely be faster - testing on my machine:它不会necessarely更快-测试我的机器上：

 %timeit df['Log_extract'] = df['Log'].str.extract('(A|B|C)') 3.74 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The .extract method follows string order ie from the string 'AB' will extract 'A' and from 'BA' will extract 'B' . .extract方法遵循字符串顺序，即从字符串'AB'提取'A' ，从'BA'提取'B' 。 On the other hand the OP function classification has an hierarchical ordering of extraction, thus extract 'A' in both cases.另一方面，OP 函数classification具有提取的分层顺序，因此在两种情况下都提取'A' 。

将具有多个返回值的矢量化函数应用到 Pandas 数据帧

问题描述

1 个解决方案

解决方案1
2 2020-01-13 19:17:37

将具有多个返回值的矢量化函数应用到 Pandas 数据帧

问题描述

1 个解决方案

解决方案1 2 2020-01-13 19:17:37

解决方案1
2 2020-01-13 19:17:37