简体   繁体   English

如何对分类数据和数据框进行预处理

[英]How to preprocess with categorical data and dataframes

I am preprocessing data for my multiple linear regression model by having a list of genres which I onehotencode我正在为我的多元线性回归模型预处理数据,方法是列出我对其进行编码的流派

genres = [
    "Action",
    "Adventure",
    "Biography",
    "Comedy",
    "Crime",
    "Erotica",
    "Fantasy",
    "Historical fiction",
    "Horror",
    "Mystery",
    "Romance",
    "Satire",
    "Scifi",
    "Speculative",
    "Thriller",
    "Western",
]

And I also have a user input x_user而且我还有一个用户输入x_user

x_user = ["Action", "Thriller",]

I want to use x_user as my X_new in:我想在以下位置使用x_user作为我的X_new

clf = linear_model.LinearRegression()
clf.fit(X, Y)
clf.predict([X_new])

As I understand it I have to use numeric values when using prediction, so I need to convert x_user to an array X_new with bool as [1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]据我了解,我在使用预测时必须使用数值,所以我需要将 x_user 转换为数组 X_new , bool 为[1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]

Is this possible to do with pandas?这可能与熊猫有关吗?

I tried我试过了

df = pd.DataFrame(data=genres, columns=["genres"])
df['X_new'] = df.genres.apply(lambda q: q.intercept(x_user)).astype(bool)

But got an error但是出错了

What's the correct way of doing this?这样做的正确方法是什么?

EDIT编辑

My training set looks something like this (after onehotencode)我的训练集看起来像这样(在 onehotencode 之后)

Y Action行动 Adventure冒险 ... ... Thriller惊悚 Western西
1.2 1.2 1 1 0 0 ... ... 1 1 1 1
4.7 4.7 0 0 1 1 ... ... 1 1 0 0
... ... ... ... ... ... ... ... ... ... ... ...

And the test set is from a user input and looks something like (after onehotencode)测试集来自用户输入,看起来像(在 onehotencode 之后)

Action行动 Thriller惊悚
1 1 1 1

But I want it to look like this但我希望它看起来像这样

Action行动 Adventure冒险 ... ... Thriller惊悚 Western西
1 1 0 0 ... ... 1 1 0 0

Could it be something as simple as:可能是这样简单的事情:

X_new = pd.DataFrame(0, index=np.arange(len(x_user)), columns=genres)
X_new['Action'] = x_user.Action
X_new['Thriller'] = x_user.Thriller

assuming x_user is a pandas dataframe.假设 x_user 是一个熊猫数据框。 If it is instead a numpy array or something similar just assign it based on its index.如果它是一个 numpy 数组或类似的东西,只需根据其索引分配它。

If you manage the columns of x_user so that they are in the correct order you could generalise to:如果您管理 x_user 的列,以便它们按正确的顺序排列,您可以概括为:

X_new[['Action','Thriller']] = x_user

If x_user is a pandas dataframe:如果 x_user 是一个熊猫数据框:

X_new[X_user.columns] = x_user.values
 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Panda DataFrames中排序的分类数据的最小值 - Minimum of ordered categorical data in Panda DataFrames 如何在Python中预处理时间序列数据以进行预测 - How to preprocess time series data in Python for forecasting 如何使用python预处理Twitter文本数据 - How to preprocess twitter text data using python 如何预处理音频数据以输入到神经网络 - How to preprocess audio data for input into a Neural Network 如何在训练前预处理顺序编号的数据? - How to preprocess sequential numbered data before training? 如何对分类数据进行矢量化 - How to vectorize categorical data 如何预处理一个巨大的数据集并保存它以便我可以在 Python 中训练数据 - How to preprocess a huge dataset and save it such that I can train the data in Python 如何预处理小于 256 x 256 的 ImageNet 数据? - How to preprocess ImageNet data that is smaller than 256 x 256? 如何预处理并将“大数据”tsv文件加载到python数据帧中? - How to preprocess and load a “big data” tsv file into a python dataframe? 如何预处理时间序列测试数据以进行分类预测? - How to preprocess timeseries test data to make a classification prediction?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM