简体   繁体   English

Pyspark DataFrame - 如何将一列从分类值转换为整数?

[英]Pyspark DataFrame - How to convert one column from categorical values to int?

I have a pyspark dataframe and I want to convert one of that column from string to int.我有一个 pyspark dataframe,我想将其中一列从字符串转换为 int。 Example:例子:

Tabela 1 :表 1

+------------+-----+
|categories  |value|
+------------+-----+
|         red| 0.23|
|       green| 0.34|
|      yellow| 0.56|
|       black| 0.11|
|         red| 0.67|
|         red| 0.34|
|       green| 0.45|
+------------+-----+

Table 2 :表 2

+------------+-----+
|categ_num   |value|
+------------+-----+
|           1| 0.23|
|           2| 0.34|
|           3| 0.56|
|           4| 0.11|
|           1| 0.67|
|           1| 0.34|
|           2| 0.45|
+------------+-----+

So, in that case: [red=1, green=2, yellow=3 and black=4].所以,在那种情况下:[red=1, green=2, yellow=3 and black=4]。

But I don't know all the colors in order to assign it manually.但我不知道所有的 colors 以便手动分配。 So, I need one way to do the attribution automatically.所以,我需要一种方法来自动进行归因。

Could anyone help me, please?有人可以帮我吗?

This code work for me: 该代码对我有用:

from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
["id", "category"])

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()

https://spark.apache.org/docs/latest/ml-features.html#stringindexer https://spark.apache.org/docs/latest/ml-features.html#stringindexer

In the case you want a solution with less code and your categories do not need to be ordered in a special way, you can use dense_rank from the pyspark functions.如果您想要一个代码更少的解决方案并且您的类别不需要以特殊方式排序,您可以使用dense_rank函数中的 dense_rank。

import pyspark.sql.functions as F
from pyspark.sql.window import Window

df.withColumn("categ_num", F.dense_rank().over(Window.orderBy("categories")))

Keep in mind, that window functions can cause longer runtime.请记住,window 函数会导致运行时间更长。

SparkML中有一个StringIndexer

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pandas DataFrame:如何将二进制列转换为一个分类列? - Pandas DataFrame: How to convert binary columns into one categorical column? 如何将pyspark数据帧一行中的字节数组转换为一列字节? - how to convert a bytearray in one row of a pyspark dataframe to a column of bytes? Pyspark | 将字符串/整数值与数据框分开 - Pyspark | Seperate the string / int values from the dataframe 如何计算pyspark数据框中一列中每个分类变量的频率? - How to count frequency of each categorical variable in a column in pyspark dataframe? 如何将行从 pyspark 中的 dataframe 转换为列但保留列名? - pyspark 或 python - How can I convert a row from a dataframe in pyspark to a column but keep the column names? - pyspark or python 如何从数据框中消除行名和列名的值导致pyspark? - How to eliminate row and column name values from the dataframe result in pyspark? 将 PySpark 数据框列从列表转换为字符串 - Convert PySpark dataframe column from list to string 如何从pyspark数据帧中将列值输出为字符串? - How to output column values from pyspark dataframe into string? 如何在 pyspark 中从另一个 dataframe 添加列? - How to add column to one dataframe from another in pyspark? 对于 dataframe,将列中的所有列表转换为 int 值 - For a dataframe, convert all list in a column to int values
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM