简体   繁体   English

如何在python中的数据框中快速生成二次数值特征?

[英]How to quickly generate quadratic numeric features in a dataframe in python?

Using python and standard libraries I'd like to quickly generate interaction features for machine learning models (classifiers or regressors). 我想使用python和标准库为机器学习模型(分类器或回归器)快速生成交互功能。 Because feature engineering by hand can be time consuming I'm looking for standard python libraries and methods that can semi-automate some of the process. 因为手工进行功能设计可能很耗时,所以我在寻找可以半自动化某些过程的标准python库和方法。 For example, to generate quadratic features for analysis I have the following code: 例如,要生成用于分析的二次特征,我有以下代码:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': ['abc', 'def', 'ghi', 'kjl'],
                   'b': [2, 5, 7, 8],
                   'c': [1.2, 3, 4, 6]})
num_cols = [col for col in df.columns if df[col].dtype in [np.int64, np.float64]]
quadratic_cols = [tuple(sorted((i,j))) for i in num_cols for j in num_cols]
quad_col_pairs = list(set(quadratic_cols))

for col_pair in quad_col_pairs:
    col1, col2 = col_pair
    quadratic_col = '{}*{}'.format(*col_pair)
    df[quadratic_col] = df[col1] * df[col2]

I'd like to simplify this code because this kind of feature engineering should more standardized and quickly deployed. 我想简化此代码,因为这种功能工程应该更加标准化并且可以更快地部署。 It also falls short because it would require more lines of code to generate derived features from addition, subtraction, or division across the feature columns. 它也不够完善,因为需要更多的代码行才能通过对要素列进行加,减或除来生成派生要素。

How can I simplify the above code? 如何简化以上代码? Is there a standard python method or library that can more efficiently generate derived features for building models? 是否有标准的python方法或库可以更有效地生成用于构建模型的派生特征?

Try this for getting required columns with avoiding loops, 尝试此操作以获取需要的列,避免循环,

import itertools
L=df.select_dtypes(include=[np.number]).columns.tolist()
quad_col_pairs =  list(itertools.combinations_with_replacement(L,2))

for col_pair in quad_col_pairs:
    col1, col2 = col_pair
    quadratic_col = '{}*{}'.format(*col_pair)
    df[quadratic_col] = df[col1] * df[col2]

Since you explicitly tag it with scikit-learn: you can use PolynomialFeatures : 由于您使用scikit-learn对其进行了显式标记:您可以使用PolynomialFeatures

from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(include_bias=False)
pf.fit_transform(df._get_numeric_data()) 

#array([[ 2.  ,  1.2 ,  4.  ,  2.4 ,  1.44],
#       [ 5.  ,  3.  , 25.  , 15.  ,  9.  ],
#       [ 7.  ,  4.  , 49.  , 28.  , 16.  ],
#       [ 8.  ,  6.  , 64.  , 48.  , 36.  ]])

It also gives you options to use higher order polynomials, and to include only the interaction terms. 它还为您提供了使用高阶多项式的选项,并且仅包括交互项。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在python中生成随机二次函数 - How to generate a random quadratic function in python 如何在 Python 中快速生成这些向量? - How to generate these vectors quickly in Python? 如何使用 pandas 快速对来自 dataframe 的数据的多个特征进行分组 - How to quickly group multiple features of data from a dataframe using pandas 用旧数据框中的值生成新数据框,作为python中的新功能 - Generate new dataframe with values in old dataframe as new features in python 当数据框的行中有要素时,如何使用python选择要素 - How to select features, using python, when dataframe has features in rows 如何从 Python 中的数据帧加载特征和标签? - How to load features and label from dataframe in Python? 在Python中快速生成布尔序列 - Generate Sequence of Booleans Quickly in Python 如何从 Pandas (Python) 中的排序时间序列索引的数据帧中的列中的所有值生成统计特征? - How can I generate statistical features from all values in a column from dataframe indexed by a sorted timeseries in Pandas (Python)? 如何在 python 中循环使用多处理快速生成解决方案? - How to use multiprocessing in a loop in python to generate solutions quickly? 如何解决 Dataframe to_numeric 错误(Python)? - How to solve Dataframe to_numeric Error (Python)?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM