简体   繁体   English

如何在Spark中将数字特征与文本(单词袋)正确组合?

[英]How do I properly combine numerical features with text (bag of words) in Spark?

My question is similar to this one but for Spark and the original question does not have a satisfactory answer. 我的问题与类似,但对于Spark和原始问题没有令人满意的答案。

I am using a Spark 2.2 LinearSVC model with tweet data as input: a tweet's text (that has been pre-processed) as hash-tfidf and also its month as follows: 我正在使用Spark 2.2 LinearSVC模型,其中输入了tweet数据:一条tweet的文本(已经过预处理)为hash-tfidf以及其月份如下:

val hashingTF = new HashingTF().setInputCol("text").setOutputCol("hash-tf")
  .setNumFeatures(30000) 
val idf = new IDF().setInputCol("hash-tf").setOutputCol("hash-tfidf")
  .setMinDocFreq(10)
val monthIndexer = new StringIndexer().setInputCol("month")
  .setOutputCol("month-idx")
val va = new VectorAssembler().setInputCols(Array("month-idx",  "hash-tfidf"))
  .setOutputCol("features")

If there are 30,000 words features won't these swamp the month? 如果有30,000个单词的功能,这些月份会不会泛滥成灾? Or is VectorAssembler smart enough to handle this. 还是VectorAssembler足够聪明来处理这个问题。 (And if possible how do I get the best features of this model?) (如果可能,如何获得该模型的最佳功能?)

VectorAssembler will simply combine all the data into a single vector, it does nothing with weights or anything else. VectorAssembler会将所有数据简单地合并为一个矢量,它对权重或其他任何操作均VectorAssembler

Since the 30,000 word vector is very sparse it is very likely that the more dense features (the months) will have a greater impact on the result, so these features would likely not get "swamped" as you put it. 由于30,000个单词的向量非常稀疏,因此较密集的特征(月份)很有可能对结果产生更大的影响,因此这些特征可能不会像您所说的那样被“淹没”。 You can train a model and check the weights of the features to confirm this. 您可以训练模型并检查功能的权重以确认这一点。 Simply use the provided coefficients method of the LinearSVCModel to see how much the features influence the final sum: 只需使用LinearSVCModel的提供的coefficients方法来查看特征对最终总和的影响程度:

val model = new LinearSVC().fit(trainingData)
val coeffs = model.coefficients

The features with higher coefficients will have a higher influence on the final result. 系数越高的特征对最终结果的影响越大。

If the weights given to the months is too low/high, it is possible to set a weight to these using the setWeightCol() method. 如果赋予月份的权重太低/太高,则可以使用setWeightCol()方法为其设置权重。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Spark MLlib决策树时如何处理缺失的数字特征? - How to handle missing numerical features when using Spark MLlib Decision Trees? 如何从scala的文本文件中提取每个单词 - How do I extract each words from a text file in scala 如何将包含WrappedArrays的Spark SchemaRDD中的两列合并到合并后的WrappedArray的第三列中? - How do I combine two columns in a Spark SchemaRDD containing WrappedArrays into a 3rd column with the combined WrappedArray? 如何在 Spark 中将 pandas 拆分应用组合样式策略与 scala api 一起使用? - How do I use the pandas split-apply-combine style strategy with scala api in spark? 如何将 spark 列中的特定单词大写? - How can I capitalize specific words in a spark column? 拆分文本并在Spark Dataframe中查找常用词 - Split text and find the common words in a Spark Dataframe 如何在Apache Spark中编码分类功能 - How to encode categorical features in Apache Spark 如何使用Scala在Spark中声明数百个功能 - How to declare hundreds of features in Spark using Scala 如何在 Spark 中编写一个独立的应用程序,以在填充了提取的推文的文本文件中找到 20 个大多数提及 - How do I write a standalone application in Spark to find 20 of most mentions in a text file filled with extracted tweets 如何在Scala中组合fastutil映射? - How do I combine fastutil maps in scala?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM