简体   繁体   中英

StringIndexer in Spark MLlib

I have a column of nominal values in my PipelinedRDD that I would like to convert to index encoding for classification purposes.

I used to use the StringIndexer in pyspark.ml which was extremely easy to use. However, this time I am learning how to deal with rdd instead of a dataframe, and there isn't such a thing in pyspark.mllib .

Any help is appreciated.

There is no StringIndexer in the Spark MLlib, so you need to do the work yourself. Start by collecting all possible values for that column and assign each a number, save this as a dictionary. Afterwards, apply it on the original rdd values.

The code below is assuming that PipelinedRDD contains two values for each row, with the value to convert in the first position (0):

dic = PipelinedRDD.map(lambda x: x[0]).distinct().zipWithIndex().collectAsMap()
PipelinedRDD = PipelinedRDD.map(lambda x: (dic[x[0]], x[1]))

Note : This differs slightly from the Spark implementation of StringIndexer since it does not take into account the frequency of the values (Spark will assign 0 to the value that appears most, then 1 and so on). However, in most cases what index different strings are assign is of no concern.


Extension If you want to mimic exactly what the StringIndexer does, as mentioned in the note above, the code can be slightly modified to take that into consideration

dic = PiplelinedRDD.groupBy('cat_column').count().sort(col("count").desc()).map(lambda x: x[0]).zipWithIndex().collectAsMap()
PipelinedRDD = PipelinedRDD.map(lambda x: (dic[x[0]], x[1]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM