I have a column of nominal values in my PipelinedRDD
that I would like to convert to index encoding for classification purposes.
I used to use the StringIndexer
in pyspark.ml
which was extremely easy to use. However, this time I am learning how to deal with rdd instead of a dataframe, and there isn't such a thing in pyspark.mllib
.
Any help is appreciated.
There is no StringIndexer
in the Spark MLlib, so you need to do the work yourself. Start by collecting all possible values for that column and assign each a number, save this as a dictionary. Afterwards, apply it on the original rdd values.
The code below is assuming that PipelinedRDD
contains two values for each row, with the value to convert in the first position (0):
dic = PipelinedRDD.map(lambda x: x[0]).distinct().zipWithIndex().collectAsMap()
PipelinedRDD = PipelinedRDD.map(lambda x: (dic[x[0]], x[1]))
Note : This differs slightly from the Spark implementation of StringIndexer
since it does not take into account the frequency of the values (Spark will assign 0 to the value that appears most, then 1 and so on). However, in most cases what index different strings are assign is of no concern.
Extension If you want to mimic exactly what the StringIndexer does, as mentioned in the note above, the code can be slightly modified to take that into consideration
dic = PiplelinedRDD.groupBy('cat_column').count().sort(col("count").desc()).map(lambda x: x[0]).zipWithIndex().collectAsMap()
PipelinedRDD = PipelinedRDD.map(lambda x: (dic[x[0]], x[1]))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.