尝试在Spark中使用TF-IDF和KMeans对文档进行聚类。这段代码有什么问题？

Question

I have a CSV file with a text field, in 2 languages (French and English). 我有一个带文本字段的CSV文件，使用2种语言（法语和英语）。 I'm attempting to perform a cluster analysis and somewhat expecting the texts to be grouped in 2 clusters due to the language difference. 我正在尝试进行聚类分析，并且由于语言的差异，我期望文本被分为两个聚类。

I came up with the following piece of code, which doesn't work as intended : 我提出了以下代码，这些代码无法按预期工作：

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.ml.clustering.KMeans

val sqlContext = new SQLContext(sc)

val customSchema = StructType(Array(
    StructField("id_suivi", StringType, true),
    StructField("id_ticket", StringType, true),
    StructField("id_affectation", StringType, true),
    StructField("id_contact", StringType, true),
    StructField("d_date", StringType, true),
    StructField("n_duree_passe", StringType, true),
    StructField("isPublic", StringType, true),
    StructField("Ticket_Request_Id", StringType, true),
    StructField("IsDoneInHNO", StringType, true),
    StructField("commments", StringType, true),
    StructField("reponse", StringType, true)))

val tokenizer = new Tokenizer().setInputCol("reponse").setOutputCol("words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(32768)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")

val df = sqlContext.read.format("com.databricks.spark.csv").
    option("header", "true").
    option("delimiter", ";").
    schema(customSchema).
    load("C:/noSave/tmp/22/tickets1.csv").
    select("id_suivi", "reponse")

val tokenizedDF = tokenizer.transform(df)
val hashedDF = hashingTF.transform(tokenizedDF).cache()

val idfModel = idf.fit(hashedDF)

val rescaledDF = idfModel.transform(hashedDF).cache()

val kmeans = new KMeans().setK(2).setSeed(1L).setFeaturesCol("features")
val model = kmeans.fit(rescaledDF)

val clusteredDF = model.transform(rescaledDF)

I would believe that this code is correct, or at least I don't see where the bug is. 我认为这段代码是正确的，或者至少我看不出错误在哪里。 However, something is really wrong because when I compute the error, it's really big : 但是，确实有问题，因为当我计算误差时，误差确实很大：

scala> model.computeCost(rescaledDF)
res0: Double = 3.1555983509935196E7

I have also tried different values for K (I thought 2 was a good value because my texts are in 2 languages (French, English)), such as 10, 100 or even bigger, looking for the "elbow" value, but no luck. 我还尝试了不同的K值（我认为2是一个很好的值，因为我的文本使用2种语言（法语，英语）），例如10、100甚至更大，寻找“肘”值，但是没有运气。

Can anyone point me in the right direction ? 谁能指出我正确的方向？

Many thanks in advance ! 提前谢谢了！

Answer 1

I'll answer my own question (hopefully this is acceptable by SO's etiquette) in case this is one day of any use for someone else. 我会回答我自己的问题（希望这对于SO的礼节是可以接受的），以防万一这一天对别人有用。

An easier way to differentiate the 2 languages is to consider their use of stop words (ie : words which are commonly common in each language). 区分这两种语言的一种更简单的方法是考虑使用停用词（即：每种语言中常见的词）。

Using TF-IDF was a bad idea to start with because it nullifies the contribution of the stop words (its purpose is to put the focus on the "uncommonly common" terms in a document) 首先，使用TF-IDF是一个坏主意，因为它会使停用词的作用无效（其目的是将重点放在文档中的“非常见”术语上）

I managed to get closer to my goal of clustering by language by using CountVectorizer which creates a dictionnary of the most frequently used terms and count those for each document. 通过使用CountVectorizer，我设法更接近按语言进行聚类的目标，它创建了最常用术语的字典并对每个文档进行计数。

The most common terms being the stop words, we end up clustering the documents by their use of stop words, which are different sets in both languages, therefore clustering by language. 最常用的术语是停用词，我们最终使用停用词对文档进行聚类，停用词在两种语言中都是不同的集合，因此按语言聚类。

尝试在Spark中使用TF-IDF和KMeans对文档进行聚类。这段代码有什么问题？

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-05-03 09:44:08

尝试在Spark中使用TF-IDF和KMeans对文档进行聚类。 这段代码有什么问题？

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-05-03 09:44:08

尝试在Spark中使用TF-IDF和KMeans对文档进行聚类。这段代码有什么问题？

解决方案1
2 已采纳 2017-05-03 09:44:08