如何对分类数据和数据框进行预处理

Question

I am preprocessing data for my multiple linear regression model by having a list of genres which I onehotencode我正在为我的多元线性回归模型预处理数据，方法是列出我对其进行编码的流派

genres = [
    "Action",
    "Adventure",
    "Biography",
    "Comedy",
    "Crime",
    "Erotica",
    "Fantasy",
    "Historical fiction",
    "Horror",
    "Mystery",
    "Romance",
    "Satire",
    "Scifi",
    "Speculative",
    "Thriller",
    "Western",
]

And I also have a user input x_user而且我还有一个用户输入x_user

x_user = ["Action", "Thriller",]

I want to use x_user as my X_new in:我想在以下位置使用x_user作为我的X_new ：

clf = linear_model.LinearRegression()
clf.fit(X, Y)
clf.predict([X_new])

As I understand it I have to use numeric values when using prediction, so I need to convert x_user to an array X_new with bool as [1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]据我了解，我在使用预测时必须使用数值，所以我需要将 x_user 转换为数组 X_new ， bool 为[1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]

Is this possible to do with pandas?这可能与熊猫有关吗？

I tried我试过了

df = pd.DataFrame(data=genres, columns=["genres"])
df['X_new'] = df.genres.apply(lambda q: q.intercept(x_user)).astype(bool)

But got an error但是出错了

What's the correct way of doing this?这样做的正确方法是什么？

EDIT编辑

My training set looks something like this (after onehotencode)我的训练集看起来像这样（在 onehotencode 之后）

Y是	Action行动	Adventure冒险	... ...	Thriller惊悚	Western西
1.2 1.2	1 1	0 0	... ...	1 1	1 1
4.7 4.7	0 0	1 1	... ...	1 1	0 0
... ...	... ...	... ...	... ...	... ...	... ...

And the test set is from a user input and looks something like (after onehotencode)测试集来自用户输入，看起来像（在 onehotencode 之后）

Action行动	Thriller惊悚
1 1	1 1

But I want it to look like this但我希望它看起来像这样

Action行动	Adventure冒险	... ...	Thriller惊悚	Western西
1 1	0 0	... ...	1 1	0 0

Answer 1

Could it be something as simple as:可能是这样简单的事情：

X_new = pd.DataFrame(0, index=np.arange(len(x_user)), columns=genres)
X_new['Action'] = x_user.Action
X_new['Thriller'] = x_user.Thriller

assuming x_user is a pandas dataframe.假设 x_user 是一个熊猫数据框。 If it is instead a numpy array or something similar just assign it based on its index.如果它是一个 numpy 数组或类似的东西，只需根据其索引分配它。

If you manage the columns of x_user so that they are in the correct order you could generalise to:如果您管理 x_user 的列，以便它们按正确的顺序排列，您可以概括为：

X_new[['Action','Thriller']] = x_user

If x_user is a pandas dataframe:如果 x_user 是一个熊猫数据框：

X_new[X_user.columns] = x_user.values

如何对分类数据和数据框进行预处理

问题描述

1 个解决方案

解决方案1
0 2021-06-28 12:41:49

如何对分类数据和数据框进行预处理

问题描述

1 个解决方案

解决方案1 0 2021-06-28 12:41:49

解决方案1
0 2021-06-28 12:41:49