[英]How to preprocess with categorical data and dataframes
I am preprocessing data for my multiple linear regression model by having a list of genres which I onehotencode我正在为我的多元线性回归模型预处理数据,方法是列出我对其进行编码的流派
genres = [
"Action",
"Adventure",
"Biography",
"Comedy",
"Crime",
"Erotica",
"Fantasy",
"Historical fiction",
"Horror",
"Mystery",
"Romance",
"Satire",
"Scifi",
"Speculative",
"Thriller",
"Western",
]
And I also have a user input x_user
而且我还有一个用户输入
x_user
x_user = ["Action", "Thriller",]
I want to use x_user
as my X_new
in:我想在以下位置使用
x_user
作为我的X_new
:
clf = linear_model.LinearRegression()
clf.fit(X, Y)
clf.predict([X_new])
As I understand it I have to use numeric values when using prediction, so I need to convert x_user to an array X_new with bool as [1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]
据我了解,我在使用预测时必须使用数值,所以我需要将 x_user 转换为数组 X_new , bool 为
[1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0]
Is this possible to do with pandas?这可能与熊猫有关吗?
I tried我试过了
df = pd.DataFrame(data=genres, columns=["genres"])
df['X_new'] = df.genres.apply(lambda q: q.intercept(x_user)).astype(bool)
But got an error但是出错了
What's the correct way of doing this?这样做的正确方法是什么?
EDIT编辑
My training set looks something like this (after onehotencode)我的训练集看起来像这样(在 onehotencode 之后)
Y![]() |
Action![]() |
Adventure![]() |
... ![]() |
Thriller![]() |
Western![]() |
---|---|---|---|---|---|
1.2 ![]() |
1 ![]() |
0 ![]() |
... ![]() |
1 ![]() |
1 ![]() |
4.7 ![]() |
0 ![]() |
1 ![]() |
... ![]() |
1 ![]() |
0 ![]() |
... ![]() |
... ![]() |
... ![]() |
... ![]() |
... ![]() |
... ![]() |
And the test set is from a user input and looks something like (after onehotencode)测试集来自用户输入,看起来像(在 onehotencode 之后)
Action![]() |
Thriller![]() |
---|---|
1 ![]() |
1 ![]() |
But I want it to look like this但我希望它看起来像这样
Action![]() |
Adventure![]() |
... ![]() |
Thriller![]() |
Western![]() |
---|---|---|---|---|
1 ![]() |
0 ![]() |
... ![]() |
1 ![]() |
0 ![]() |
Could it be something as simple as:可能是这样简单的事情:
X_new = pd.DataFrame(0, index=np.arange(len(x_user)), columns=genres)
X_new['Action'] = x_user.Action
X_new['Thriller'] = x_user.Thriller
assuming x_user is a pandas dataframe.假设 x_user 是一个熊猫数据框。 If it is instead a numpy array or something similar just assign it based on its index.
如果它是一个 numpy 数组或类似的东西,只需根据其索引分配它。
If you manage the columns of x_user so that they are in the correct order you could generalise to:如果您管理 x_user 的列,以便它们按正确的顺序排列,您可以概括为:
X_new[['Action','Thriller']] = x_user
If x_user is a pandas dataframe:如果 x_user 是一个熊猫数据框:
X_new[X_user.columns] = x_user.values
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.