简体   繁体   English

多个 Output 机器学习 Model - Python

[英]Multiple Output Machine Learning Model - Python

Hello everyone I've tried searching this topic and haven't been able to find a good answer so I was hoping someone could help me out.大家好,我已经尝试搜索这个主题并且无法找到一个好的答案,所以我希望有人可以帮助我。 Let's say I am trying to create a ML model using scikit-learn and python.假设我正在尝试使用 scikit-learn 和 python 创建一个 ML model。 I have a data set as such:我有一个这样的数据集:

| Features | Topic   | Sub-Topic        |
|----------|---------|------------------|
| ...      | Science | Space            |
| ...      | Science | Engineering      |
| ...      | History | American History |
| ...      | History | European History |

My features list is composed of just text such as a small paragraph from some essay.我的功能列表仅由文本组成,例如一些文章中的一小段。 Now I want to be able to use ML to predict what the topic and sub-topic of that text will be.现在我希望能够使用 ML 来预测该文本的主题和子主题。

I know I would need to use some sort of NLP to analyze the text such as spaCy.我知道我需要使用某种 NLP 来分析诸如 spaCy 之类的文本。 The part where I am confused is on having two output variables: topic and sub-topic.我感到困惑的部分是有两个 output 变量:主题和子主题。 I've read that scikit-learn has something called MultiOutputClassifier, but then there is also something called MultiClass Classification so I'm just a little confused as to what route to take.我读过 scikit-learn 有一个叫做 MultiOutputClassifier 的东西,但是还有一个叫做 MultiClass Classification 的东西,所以我对采取什么路线有点困惑。

Could someone please point me in the right direction as to what regressor to use or how to achieve this?有人可以为我指出使用什么回归器或如何实现这一点的正确方向吗?

So MultiClass is just saying there are multiple classes in one target variable.所以 MultiClass 只是说一个目标变量中有多个类。 MultiOutput means we have more than one target variable. MultiOutput 意味着我们有多个目标变量。 Here we have a MultiClass-MultiOutput problem.这里我们有一个MultiClass-MultiOutput问题。

scikit-learn supports MultiClass-MultiOutput for the below classifier natively. scikit-learn 原生支持以下分类器的MultiClass-MultiOutput

sklearn.tree.DecisionTreeClassifier
sklearn.tree.ExtraTreeClassifier
sklearn.ensemble.ExtraTreesClassifier
sklearn.neighbors.KNeighborsClassifier
sklearn.neighbors.RadiusNeighborsClassifier
sklearn.ensemble.RandomForestClassifier

I'd suggest picking up RandomForest as most of the times it gives great results out of the box.我建议选择 RandomForest,因为大多数情况下它开箱即用,效果很好。

So to take a dummy example to demonstrate the api of RandomForestClassifier for multiple targets.所以举一个虚拟的例子来演示 RandomForestClassifier 的RandomForestClassifier用于多个目标。

### Dummy Example only to test functionality
np.random.seed(0)
X = np.random.randn(10,2)
y1 = (X[:,[0]]>.5).astype(int) # make dummy y1
y2 = (X[:,[1]]<.5).astype(int) # make dummy y2
y = np.hstack([y1,y2]) # y has 2 columns
print("X = ",X,sep="\n",end="\n\n")
print("y = ",y,sep="\n",end="\n\n")
rfc = RandomForestClassifier().fit(X, y) # use the same api for multi column y!
out = rfc.predict(X)
print("Output = ",out,sep="\n")

Output Output

X = 
[[ 1.76405235  0.40015721]
 [ 0.97873798  2.2408932 ]
 [ 1.86755799 -0.97727788]
 [ 0.95008842 -0.15135721]
 [-0.10321885  0.4105985 ]
 [ 0.14404357  1.45427351]
 [ 0.76103773  0.12167502]
 [ 0.44386323  0.33367433]
 [ 1.49407907 -0.20515826]
 [ 0.3130677  -0.85409574]]

y = 
[[1 1]
 [1 0]
 [1 1]
 [1 1]
 [0 1]
 [0 0]
 [1 1]
 [0 1]
 [1 1]
 [0 1]]

Output = 
[[1 1]
 [1 0]
 [1 1]
 [1 1]
 [0 1]
 [0 0]
 [1 1]
 [0 1]
 [1 1]
 [0 1]]

On a side note, as you are doing an NLP related model, I'd suggest using Keras's multi-output NN api to train a neural network for better outputs!在旁注中,当您正在执行与 NLP 相关的 model 时,我建议使用Keras 的多输出 NN api来训练神经网络以获得更好的输出!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM