Attempting to cluster documents with TF-IDF and KMeans in Spark. What's wrong with this piece of code?

Question

I have a CSV file with a text field, in 2 languages (French and English). I'm attempting to perform a cluster analysis and somewhat expecting the texts to be grouped in 2 clusters due to the language difference.

I came up with the following piece of code, which doesn't work as intended :

import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType}
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
import org.apache.spark.ml.clustering.KMeans

val sqlContext = new SQLContext(sc)

val customSchema = StructType(Array(
    StructField("id_suivi", StringType, true),
    StructField("id_ticket", StringType, true),
    StructField("id_affectation", StringType, true),
    StructField("id_contact", StringType, true),
    StructField("d_date", StringType, true),
    StructField("n_duree_passe", StringType, true),
    StructField("isPublic", StringType, true),
    StructField("Ticket_Request_Id", StringType, true),
    StructField("IsDoneInHNO", StringType, true),
    StructField("commments", StringType, true),
    StructField("reponse", StringType, true)))

val tokenizer = new Tokenizer().setInputCol("reponse").setOutputCol("words")
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(32768)
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")

val df = sqlContext.read.format("com.databricks.spark.csv").
    option("header", "true").
    option("delimiter", ";").
    schema(customSchema).
    load("C:/noSave/tmp/22/tickets1.csv").
    select("id_suivi", "reponse")

val tokenizedDF = tokenizer.transform(df)
val hashedDF = hashingTF.transform(tokenizedDF).cache()

val idfModel = idf.fit(hashedDF)

val rescaledDF = idfModel.transform(hashedDF).cache()

val kmeans = new KMeans().setK(2).setSeed(1L).setFeaturesCol("features")
val model = kmeans.fit(rescaledDF)

val clusteredDF = model.transform(rescaledDF)

I would believe that this code is correct, or at least I don't see where the bug is. However, something is really wrong because when I compute the error, it's really big :

scala> model.computeCost(rescaledDF)
res0: Double = 3.1555983509935196E7

I have also tried different values for K (I thought 2 was a good value because my texts are in 2 languages (French, English)), such as 10, 100 or even bigger, looking for the "elbow" value, but no luck.

Can anyone point me in the right direction ?

Many thanks in advance !

Answer 1

I'll answer my own question (hopefully this is acceptable by SO's etiquette) in case this is one day of any use for someone else.

An easier way to differentiate the 2 languages is to consider their use of stop words (ie : words which are commonly common in each language).

Using TF-IDF was a bad idea to start with because it nullifies the contribution of the stop words (its purpose is to put the focus on the "uncommonly common" terms in a document)

I managed to get closer to my goal of clustering by language by using CountVectorizer which creates a dictionnary of the most frequently used terms and count those for each document.

The most common terms being the stop words, we end up clustering the documents by their use of stop words, which are different sets in both languages, therefore clustering by language.

Attempting to cluster documents with TF-IDF and KMeans in Spark. What's wrong with this piece of code?

Question

1 answers

solution1
2 ACCPTED 2017-05-03 09:44:08

Attempting to cluster documents with TF-IDF and KMeans in Spark. What's wrong with this piece of code?

Question

1 answers

solution1 2 ACCPTED 2017-05-03 09:44:08

solution1
2 ACCPTED 2017-05-03 09:44:08