Pyspark DataFrame - 如何将一列从分类值转换为整数？

Question

I have a pyspark dataframe and I want to convert one of that column from string to int.我有一个 pyspark dataframe，我想将其中一列从字符串转换为 int。 Example:例子：

Tabela 1 :表 1 ：

+------------+-----+
|categories  |value|
+------------+-----+
|         red| 0.23|
|       green| 0.34|
|      yellow| 0.56|
|       black| 0.11|
|         red| 0.67|
|         red| 0.34|
|       green| 0.45|
+------------+-----+

Table 2 :表 2 ：

+------------+-----+
|categ_num   |value|
+------------+-----+
|           1| 0.23|
|           2| 0.34|
|           3| 0.56|
|           4| 0.11|
|           1| 0.67|
|           1| 0.34|
|           2| 0.45|
+------------+-----+

So, in that case: [red=1, green=2, yellow=3 and black=4].所以，在那种情况下：[red=1, green=2, yellow=3 and black=4]。

But I don't know all the colors in order to assign it manually.但我不知道所有的 colors 以便手动分配。 So, I need one way to do the attribution automatically.所以，我需要一种方法来自动进行归因。

Could anyone help me, please?有人可以帮我吗？

Answer 1

This code work for me: 该代码对我有用：

from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
["id", "category"])

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()

https://spark.apache.org/docs/latest/ml-features.html#stringindexer https://spark.apache.org/docs/latest/ml-features.html#stringindexer

Answer 2

In the case you want a solution with less code and your categories do not need to be ordered in a special way, you can use dense_rank from the pyspark functions.如果您想要一个代码更少的解决方案并且您的类别不需要以特殊方式排序，您可以使用dense_rank函数中的 dense_rank。

import pyspark.sql.functions as F
from pyspark.sql.window import Window

df.withColumn("categ_num", F.dense_rank().over(Window.orderBy("categories")))

Keep in mind, that window functions can cause longer runtime.请记住，window 函数会导致运行时间更长。

Answer 3

SparkML中有一个StringIndexer 。

Pyspark DataFrame - 如何将一列从分类值转换为整数？

问题描述

3 个解决方案

解决方案1
2 已采纳 2017-08-04 17:02:02

解决方案2
1 2021-09-28 08:02:36

解决方案3
0 2017-08-04 13:58:45

Pyspark DataFrame - 如何将一列从分类值转换为整数？

问题描述

3 个解决方案

解决方案1 2 已采纳 2017-08-04 17:02:02

解决方案2 1 2021-09-28 08:02:36

解决方案3 0 2017-08-04 13:58:45

解决方案1
2 已采纳 2017-08-04 17:02:02

解决方案2
1 2021-09-28 08:02:36

解决方案3
0 2017-08-04 13:58:45