如何在Spark中将数字特征与文本（单词袋）正确组合？

Question

My question is similar to this one but for Spark and the original question does not have a satisfactory answer. 我的问题与此类似，但对于Spark和原始问题没有令人满意的答案。

I am using a Spark 2.2 LinearSVC model with tweet data as input: a tweet's text (that has been pre-processed) as hash-tfidf and also its month as follows: 我正在使用Spark 2.2 LinearSVC模型，其中输入了tweet数据：一条tweet的文本（已经过预处理）为hash-tfidf以及其月份如下：

val hashingTF = new HashingTF().setInputCol("text").setOutputCol("hash-tf")
  .setNumFeatures(30000) 
val idf = new IDF().setInputCol("hash-tf").setOutputCol("hash-tfidf")
  .setMinDocFreq(10)
val monthIndexer = new StringIndexer().setInputCol("month")
  .setOutputCol("month-idx")
val va = new VectorAssembler().setInputCols(Array("month-idx",  "hash-tfidf"))
  .setOutputCol("features")

If there are 30,000 words features won't these swamp the month? 如果有30,000个单词的功能，这些月份会不会泛滥成灾？ Or is VectorAssembler smart enough to handle this. 还是VectorAssembler足够聪明来处理这个问题。 (And if possible how do I get the best features of this model?) （如果可能，如何获得该模型的最佳功能？）

Answer 1

VectorAssembler will simply combine all the data into a single vector, it does nothing with weights or anything else. VectorAssembler会将所有数据简单地合并为一个矢量，它对权重或其他任何操作均VectorAssembler 。

Since the 30,000 word vector is very sparse it is very likely that the more dense features (the months) will have a greater impact on the result, so these features would likely not get "swamped" as you put it. 由于30,000个单词的向量非常稀疏，因此较密集的特征（月份）很有可能对结果产生更大的影响，因此这些特征可能不会像您所说的那样被“淹没”。 You can train a model and check the weights of the features to confirm this. 您可以训练模型并检查功能的权重以确认这一点。 Simply use the provided coefficients method of the LinearSVCModel to see how much the features influence the final sum: 只需使用LinearSVCModel的提供的coefficients方法来查看特征对最终总和的影响程度：

val model = new LinearSVC().fit(trainingData)
val coeffs = model.coefficients

The features with higher coefficients will have a higher influence on the final result. 系数越高的特征对最终结果的影响越大。

If the weights given to the months is too low/high, it is possible to set a weight to these using the setWeightCol() method. 如果赋予月份的权重太低/太高，则可以使用setWeightCol()方法为其设置权重。

如何在Spark中将数字特征与文本（单词袋）正确组合？

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-01-08 07:01:26

如何在Spark中将数字特征与文本（单词袋）正确组合？

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-01-08 07:01:26

解决方案1
1 已采纳 2018-01-08 07:01:26