简体   繁体   English

Spark MLlib中的StringIndexer

[英]StringIndexer in Spark MLlib

I have a column of nominal values in my PipelinedRDD that I would like to convert to index encoding for classification purposes. 我在PipelinedRDD中有一列标称值,为了进行分类,我想将其转换为索引编码。

I used to use the StringIndexer in pyspark.ml which was extremely easy to use. 我曾经在pyspark.ml使用StringIndexer ,它非常易于使用。 However, this time I am learning how to deal with rdd instead of a dataframe, and there isn't such a thing in pyspark.mllib . 但是,这次我正在学习如何处理rdd而不是数据帧,并且pyspark.mllib没有这样的东西。

Any help is appreciated. 任何帮助表示赞赏。

There is no StringIndexer in the Spark MLlib, so you need to do the work yourself. Spark MLlib中没有StringIndexer ,因此您需要自己完成工作。 Start by collecting all possible values for that column and assign each a number, save this as a dictionary. 首先收集该列的所有可能值,并为每个数字分配一个数字,然后将其另存为字典。 Afterwards, apply it on the original rdd values. 然后,将其应用于原始的rdd值。

The code below is assuming that PipelinedRDD contains two values for each row, with the value to convert in the first position (0): 下面的代码假定PipelinedRDD每行包含两个值,该值在第一个位置(0)进行转换:

dic = PipelinedRDD.map(lambda x: x[0]).distinct().zipWithIndex().collectAsMap()
PipelinedRDD = PipelinedRDD.map(lambda x: (dic[x[0]], x[1]))

Note : This differs slightly from the Spark implementation of StringIndexer since it does not take into account the frequency of the values (Spark will assign 0 to the value that appears most, then 1 and so on). 注意 :这与StringIndexer的Spark实现略有不同,因为它没有考虑值的频率(Spark会将0分配给最出现的值,然后是1,依此类推)。 However, in most cases what index different strings are assign is of no concern. 但是,在大多数情况下,分配给哪个索引不同的字符串并不重要。


Extension If you want to mimic exactly what the StringIndexer does, as mentioned in the note above, the code can be slightly modified to take that into consideration 扩展如果您想精确地模仿StringIndexer的功能(如上面的注释中所述),可以对代码进行略微修改以考虑到这一点

dic = PiplelinedRDD.groupBy('cat_column').count().sort(col("count").desc()).map(lambda x: x[0]).zipWithIndex().collectAsMap()
PipelinedRDD = PipelinedRDD.map(lambda x: (dic[x[0]], x[1]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM