简体   繁体   English

使用sklearn-python进行具有分类特征的多元线性回归

[英]Multiple linear regression with categorical features using sklearn - python

I have a dataset, where each document possesses a corresponding score/ rating我有一个数据集,其中每个文档都拥有相应的分数/评级

dataset = [
   {"text":"I don't like this small device", "rating":"2"},
   {"text":"Really love this large device", "rating":"5"},
   ....
]

In addition, I have a category(variable) of term lists extracted out of text variables from the same dataset另外,我有一个从同一数据集中的text变量中提取的术语列表的类别(变量)

x1 = [short, slim, small, shrink]
x2 = [big,huge,large]

So, how can I do the linear regression with multiple independent variables as a word lists( or varible representing the existence of any word from corresponding term list, because each term in lists is unique ) above and the dependent variable as a rating .那么,我如何将multiple independent variables作为单词列表(或代表相应术语列表中任何单词存在的变量,因为列表中的每个术语都是唯一的)和dependent variable as a rating进行线性回归。 In other words换句话说

how could I evaluate term lists impact on the rating with sklearn我如何评估术语列表对 sklearn 评分的影响

I used TfidfVectorizer to derive the document-term matrix.我使用TfidfVectorizer来导出文档项矩阵。 If it's possible please provide simple code snippet or example.如果可能,请提供简单的代码片段或示例。

Given the discussion in the comments, it seems that the interpretation should be that each list defines a binary variable whose value depends on whether or not any words from the list appear in the text in question.鉴于评论中的讨论,似乎解释应该是每个列表定义一个二进制变量,其值取决于列表中的任何单词是否出现在相关文本中。 So, let us first change the texts so that the words actually appear:因此,让我们首先更改文本,以便实际出现单词:

dataset = [
   {"text": "I don't like this large device", "rating": "2"},
   {"text": "Really love this small device", "rating": "5"},
   {"text": "Some other text", "rating": "3"}
]

To simplify our work, we'll then load this data into a data frame, change the ratings to be integers, and create the relevant variables:为了简化我们的工作,我们然后将这些数据加载到数据框中,将评级更改为整数,并创建相关变量:

df = pd.DataFrame(dataset)
df['rating'] = df['rating'].astype(int)
df['text'] = df['text'].str.split().apply(set)
x1 = ['short', 'slim', 'small', 'shrink']
x2 = ['big', 'huge', 'large']
df['x1'] =  df.text.apply(lambda x: x.intersection(x1)).astype(bool)
df['x2'] =  df.text.apply(lambda x: x.intersection(x2)).astype(bool)

That is, at this point df is the following data frame:即此时df为以下数据框:

   rating                                   text     x1     x2
0       2  {this, large, don't, like, device, I}  False   True
1       5    {this, small, love, Really, device}   True  False
2       3                    {other, Some, text}  False  False

With this, we can create the relevant model, and check what the coefficients end up being:有了这个,我们可以创建相关模型,并检查系数最终是什么:

model = LinearRegression()
model.fit(df[['x1', 'x2']], df.rating)
print(model.coef_)  # array([ 2., -1.])
print(model.intercept_)  # 3.0

As also mentioned in the comments, this thing will produce at most four ratings, one for each of the combinations of x1 and x2 being True or False .正如评论中也提到的,这件事最多会产生四个评级,一个用于x1x2每个组合是TrueFalse In this case, it just so happens that all possible outputs are integers, but in general, they need not be, nor need they be confined to the interval of interest.在这种情况下,碰巧所有可能的输出都是整数,但一般来说,它们不必如此,也不必将它们限制在感兴趣的区间内。 Given the ordinal nature of the ratings, this is really a case for some sort of ordinal regression (cf. eg mord ).鉴于评级的序数性质,这确实是某种序数回归的情况(参见例如mord )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM