[英]How do I properly combine numerical features with text (bag of words) in Spark?
My question is similar to this one but for Spark and the original question does not have a satisfactory answer. 我的问题与此类似,但对于Spark和原始问题没有令人满意的答案。
I am using a Spark 2.2 LinearSVC model with tweet data as input: a tweet's text (that has been pre-processed) as hash-tfidf and also its month as follows: 我正在使用Spark 2.2 LinearSVC模型,其中输入了tweet数据:一条tweet的文本(已经过预处理)为hash-tfidf以及其月份如下:
val hashingTF = new HashingTF().setInputCol("text").setOutputCol("hash-tf")
.setNumFeatures(30000)
val idf = new IDF().setInputCol("hash-tf").setOutputCol("hash-tfidf")
.setMinDocFreq(10)
val monthIndexer = new StringIndexer().setInputCol("month")
.setOutputCol("month-idx")
val va = new VectorAssembler().setInputCols(Array("month-idx", "hash-tfidf"))
.setOutputCol("features")
If there are 30,000 words features won't these swamp the month? 如果有30,000个单词的功能,这些月份会不会泛滥成灾? Or is
VectorAssembler
smart enough to handle this. 还是
VectorAssembler
足够聪明来处理这个问题。 (And if possible how do I get the best features of this model?) (如果可能,如何获得该模型的最佳功能?)
VectorAssembler
will simply combine all the data into a single vector, it does nothing with weights or anything else. VectorAssembler
会将所有数据简单地合并为一个矢量,它对权重或其他任何操作均VectorAssembler
。
Since the 30,000 word vector is very sparse it is very likely that the more dense features (the months) will have a greater impact on the result, so these features would likely not get "swamped" as you put it. 由于30,000个单词的向量非常稀疏,因此较密集的特征(月份)很有可能对结果产生更大的影响,因此这些特征可能不会像您所说的那样被“淹没”。 You can train a model and check the weights of the features to confirm this.
您可以训练模型并检查功能的权重以确认这一点。 Simply use the provided
coefficients
method of the LinearSVCModel
to see how much the features influence the final sum: 只需使用
LinearSVCModel
的提供的coefficients
方法来查看特征对最终总和的影响程度:
val model = new LinearSVC().fit(trainingData)
val coeffs = model.coefficients
The features with higher coefficients will have a higher influence on the final result. 系数越高的特征对最终结果的影响越大。
If the weights given to the months is too low/high, it is possible to set a weight to these using the setWeightCol()
method. 如果赋予月份的权重太低/太高,则可以使用
setWeightCol()
方法为其设置权重。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.