StringIndexer in Spark MLlib

Question

I have a column of nominal values in my PipelinedRDD that I would like to convert to index encoding for classification purposes.

I used to use the StringIndexer in pyspark.ml which was extremely easy to use. However, this time I am learning how to deal with rdd instead of a dataframe, and there isn't such a thing in pyspark.mllib .

Any help is appreciated.

Answer 1

There is no StringIndexer in the Spark MLlib, so you need to do the work yourself. Start by collecting all possible values for that column and assign each a number, save this as a dictionary. Afterwards, apply it on the original rdd values.

The code below is assuming that PipelinedRDD contains two values for each row, with the value to convert in the first position (0):

dic = PipelinedRDD.map(lambda x: x[0]).distinct().zipWithIndex().collectAsMap()
PipelinedRDD = PipelinedRDD.map(lambda x: (dic[x[0]], x[1]))

Note : This differs slightly from the Spark implementation of StringIndexer since it does not take into account the frequency of the values (Spark will assign 0 to the value that appears most, then 1 and so on). However, in most cases what index different strings are assign is of no concern.

Extension If you want to mimic exactly what the StringIndexer does, as mentioned in the note above, the code can be slightly modified to take that into consideration

dic = PiplelinedRDD.groupBy('cat_column').count().sort(col("count").desc()).map(lambda x: x[0]).zipWithIndex().collectAsMap()
PipelinedRDD = PipelinedRDD.map(lambda x: (dic[x[0]], x[1]))

StringIndexer in Spark MLlib

Question

1 answers

solution1
2 ACCPTED 2018-02-14 06:10:50

StringIndexer in Spark MLlib

Question

1 answers

solution1 2 ACCPTED 2018-02-14 06:10:50

solution1
2 ACCPTED 2018-02-14 06:10:50