[英]Pyspark DataFrame - How to convert one column from categorical values to int?
I have a pyspark dataframe and I want to convert one of that column from string to int.我有一个 pyspark dataframe,我想将其中一列从字符串转换为 int。 Example:例子:
Tabela 1 :表 1 :
+------------+-----+
|categories |value|
+------------+-----+
| red| 0.23|
| green| 0.34|
| yellow| 0.56|
| black| 0.11|
| red| 0.67|
| red| 0.34|
| green| 0.45|
+------------+-----+
Table 2 :表 2 :
+------------+-----+
|categ_num |value|
+------------+-----+
| 1| 0.23|
| 2| 0.34|
| 3| 0.56|
| 4| 0.11|
| 1| 0.67|
| 1| 0.34|
| 2| 0.45|
+------------+-----+
So, in that case: [red=1, green=2, yellow=3 and black=4].所以,在那种情况下:[red=1, green=2, yellow=3 and black=4]。
But I don't know all the colors in order to assign it manually.但我不知道所有的 colors 以便手动分配。 So, I need one way to do the attribution automatically.所以,我需要一种方法来自动进行归因。
Could anyone help me, please?有人可以帮我吗?
This code work for me: 该代码对我有用:
from pyspark.ml.feature import StringIndexer
df = spark.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
["id", "category"])
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()
https://spark.apache.org/docs/latest/ml-features.html#stringindexer https://spark.apache.org/docs/latest/ml-features.html#stringindexer
In the case you want a solution with less code and your categories do not need to be ordered in a special way, you can use dense_rank
from the pyspark functions.如果您想要一个代码更少的解决方案并且您的类别不需要以特殊方式排序,您可以使用dense_rank
函数中的 dense_rank。
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df.withColumn("categ_num", F.dense_rank().over(Window.orderBy("categories")))
Keep in mind, that window functions can cause longer runtime.请记住,window 函数会导致运行时间更长。
SparkML中有一个StringIndexer 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.