简体   繁体   English

应用/向量化/加速逐列清理功能到pandas数据帧

[英]apply / vectorize / speed up column-wise cleanup function to pandas dataframe

I have some data pipeline code that applies transformation / cleanup logic to columns of a Pandas dataframe based on their name. 我有一些数据管道代码,它根据名称将转换/清理逻辑应用于Pandas数据帧的列。

Right now I'm iterating over the columns using df.iteritems() which according to this guide on optimizing Pandas apply functions is better than crude looping but is "the least efficient way to run most standard functions". 现在我正在使用df.iteritems()迭代列,根据本指南优化Pandas应用函数比粗循环更好但是“运行大多数标准函数的效率最低的方法”。

I'd like to improve the performance of this code either by taking advantage of the Pandas's ability to vectorize these operations, or some other parallel approach. 我想通过利用Pandas对这些操作进行矢量化的能力或其他一些并行方法来提高此代码的性能。

All of the worked examples I have seen illustrate how to do this row-wise (eg, compute on a Series instead of computing on a single row) but I haven't been able to find a good example of how to do this column-wise . 我见过的所有工作示例都说明了如何按进行此操作(例如,在一个系列上进行计算而不是在一行上进行计算)但我无法找到如何执行此的良好示例 - 明智的

Here is a reproducible / toy example using the Boston dataset from scikit learn. 这是一个使用来自scikit learn的Boston数据集的可重现/玩具示例。 Desired outcome is to implement the cleaning logic in a vectorized / parallel manner (without using .iteritems() or looping). 期望的结果是以矢量化/并行方式实现清理逻辑(不使用.iteritems()或循环)。 Thanks! 谢谢!

from typing import Callable

# sample df from sklearn
from sklearn import datasets
boston = datasets.load_boston()
boston = pd.DataFrame(boston.data, columns=boston.feature_names)
boston.head()

def double_it(col: pd.Series) -> pd.Series:
    return col.multiply(2)

def make_string(col: pd.Series) -> pd.Series:
    return col.astype(str)

def do_nothing(col: pd.Series) -> pd.Series:
    return col

def match_cleaner(col_name: str) -> Callable:
    if col_name in ['ZN', 'NOX', 'INDUS', 'AGE']:
        return double_it
    elif col_name in ['TAX', 'DIS', 'CHAS', 'PTRATIO']:
        return make_string
    else:
        print(col_name)
        return do_nothing

for key, value in boston.iteritems():
    cleaning_func = match_cleaner(key)
    boston.loc[:, key] = cleaning_func(value)

# confirm changes
boston.head()
print(boston.dtypes)

You could use pandas.DataFrame.apply . 你可以使用pandas.DataFrame.apply The apply method will by default apply the provided function across all columns in the dataframe. 默认情况下, apply方法将在数据框的所有列中应用提供的函数。 But you would need to modify your match_cleaner function a bit. 但是你需要修改一下match_cleaner函数。

def match_cleaner2(col):
     col_name = col.name
     if col_name in ['ZN', 'NOX', 'INDUS', 'AGE']:
         return double_it(col)
     elif col_name in ['TAX', 'DIS', 'CHAS', 'PTRATIO']:
         return make_string(col)
     else:
         return do_nothing(col)

b2 = boston.apply(match_cleaner2)
b2.head()
      CRIM             ZN          INDUS  ...   PTRATIO       B  LSTAT
0  0.00632  3.932955e+246  5.047292e+245  ...      15.3  396.90   4.98
1  0.02731   0.000000e+00  1.544777e+246  ...      17.8  396.90   9.14
2  0.02729   0.000000e+00  1.544777e+246  ...      17.8  392.83   4.03
3  0.03237   0.000000e+00  4.763245e+245  ...      18.7  394.63   2.94
4  0.06905   0.000000e+00  4.763245e+245  ...      18.7  396.90   5.33

%timeit boston.apply(match_cleaner2)
3.68 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

def original():
     for k, v in boston.iteritems():
         clean_f = match_cleaner(k)
         boston.loc[:, k] = clean_f(v)

original()
boston.head()
      CRIM             ZN          INDUS  ...   PTRATIO       B  LSTAT
0  0.00632  3.932955e+246  5.047292e+245  ...      15.3  396.90   4.98
1  0.02731   0.000000e+00  1.544777e+246  ...      17.8  396.90   9.14
2  0.02729   0.000000e+00  1.544777e+246  ...      17.8  392.83   4.03
3  0.03237   0.000000e+00  4.763245e+245  ...      18.7  394.63   2.94
4  0.06905   0.000000e+00  4.763245e+245  ...      18.7  396.90   5.33


pd.testing.assert_frame_equal(b2, boston) # boston was modified in place

# No AssertionError means frames are equal

%timeit original()
6.14 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So from a very rough experiment the apply function looks to speed this up ~40%. 因此,从一个非常粗略的实验中,应用函数看起来可以加快这个速度达到约40%。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM