![](/img/trans.png)
[英]How to create Dataframe from Fixed_width_column (dictionary) - Pyspark
[英]PySpark create external dictionary from a dataframe column of strings
我有一个数据框如下:
data = [
("100", 'the boy wants go to school'),
("200", 'he is a good boy'),
("300", 'he likes to play football in the school')
]
schema = ['id', 'description']
df = spark.createDataFrame(data, schema=schema)
我想根据“描述”列中每一行中的单词创建一个外部词典(即不是一个新列;我需要稍后单独访问该词典)。
期望的输出即我的字典应该是:
the: 2
boy: 1
wants: 1
he: 2
school: 2
play: 1
...
我知道如何使用熊猫来做到这一点。 我怎样才能使用 PySpark做到这一点?
(我尝试过 MapType、udf 等,但未能成功。)
提前致谢!
应该做到以下几点:
import pyspark.sql.functions as f
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [
("100", "the boy wants go to school"),
("200", "he is a good boy"),
("300", "he likes to play football in the school"),
]
schema = ["id", "description"]
df = spark.createDataFrame(data, schema=schema)
df = (
df.withColumn("word", f.explode(f.split(f.col("description"), " ")))
.groupBy("word")
.count()
.sort("count", ascending=False)
)
res = df.rdd.map(lambda row: row.asDict()).collect()
res = {d["word"]: d["count"] for d in res}
print(res)
导出字典的另一种方法是将您的 Spark DataFrame 转换为 Pandas,如此处所示
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.