如何在python中的数据框中快速生成二次数值特征？

Question

Using python and standard libraries I'd like to quickly generate interaction features for machine learning models (classifiers or regressors). 我想使用python和标准库为机器学习模型（分类器或回归器）快速生成交互功能。 Because feature engineering by hand can be time consuming I'm looking for standard python libraries and methods that can semi-automate some of the process. 因为手工进行功能设计可能很耗时，所以我在寻找可以半自动化某些过程的标准python库和方法。 For example, to generate quadratic features for analysis I have the following code: 例如，要生成用于分析的二次特征，我有以下代码：

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': ['abc', 'def', 'ghi', 'kjl'],
                   'b': [2, 5, 7, 8],
                   'c': [1.2, 3, 4, 6]})
num_cols = [col for col in df.columns if df[col].dtype in [np.int64, np.float64]]
quadratic_cols = [tuple(sorted((i,j))) for i in num_cols for j in num_cols]
quad_col_pairs = list(set(quadratic_cols))

for col_pair in quad_col_pairs:
    col1, col2 = col_pair
    quadratic_col = '{}*{}'.format(*col_pair)
    df[quadratic_col] = df[col1] * df[col2]

I'd like to simplify this code because this kind of feature engineering should more standardized and quickly deployed. 我想简化此代码，因为这种功能工程应该更加标准化并且可以更快地部署。 It also falls short because it would require more lines of code to generate derived features from addition, subtraction, or division across the feature columns. 它也不够完善，因为需要更多的代码行才能通过对要素列进行加，减或除来生成派生要素。

How can I simplify the above code? 如何简化以上代码？ Is there a standard python method or library that can more efficiently generate derived features for building models? 是否有标准的python方法或库可以更有效地生成用于构建模型的派生特征？

Answer 1

Try this for getting required columns with avoiding loops, 尝试此操作以获取需要的列，避免循环，

import itertools
L=df.select_dtypes(include=[np.number]).columns.tolist()
quad_col_pairs =  list(itertools.combinations_with_replacement(L,2))

for col_pair in quad_col_pairs:
    col1, col2 = col_pair
    quadratic_col = '{}*{}'.format(*col_pair)
    df[quadratic_col] = df[col1] * df[col2]

Answer 2

Since you explicitly tag it with scikit-learn: you can use PolynomialFeatures : 由于您使用scikit-learn对其进行了显式标记：您可以使用PolynomialFeatures ：

from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(include_bias=False)
pf.fit_transform(df._get_numeric_data()) 

#array([[ 2.  ,  1.2 ,  4.  ,  2.4 ,  1.44],
#       [ 5.  ,  3.  , 25.  , 15.  ,  9.  ],
#       [ 7.  ,  4.  , 49.  , 28.  , 16.  ],
#       [ 8.  ,  6.  , 64.  , 48.  , 36.  ]])

It also gives you options to use higher order polynomials, and to include only the interaction terms. 它还为您提供了使用高阶多项式的选项，并且仅包括交互项。

如何在python中的数据框中快速生成二次数值特征？

问题描述

2 个解决方案

解决方案1
1 2018-11-03 20:04:21

解决方案2
1 已采纳 2018-11-03 20:43:23

如何在python中的数据框中快速生成二次数值特征？

问题描述

2 个解决方案

解决方案1 1 2018-11-03 20:04:21

解决方案2 1 已采纳 2018-11-03 20:43:23

解决方案1
1 2018-11-03 20:04:21

解决方案2
1 已采纳 2018-11-03 20:43:23