PySpark 从字符串的数据框列创建外部字典

Question

我有一个数据框如下：

data = [
    ("100", 'the boy wants go to school'),
    ("200", 'he is a good boy'),
    ("300", 'he likes to play football in the school')
]
schema = ['id', 'description']
df = spark.createDataFrame(data, schema=schema)

我想根据“描述”列中每一行中的单词创建一个外部词典（即不是一个新列；我需要稍后单独访问该词典）。

期望的输出即我的字典应该是：

the: 2
boy: 1
wants: 1
he: 2
school: 2
play: 1
...

我知道如何使用熊猫来做到这一点。 我怎样才能使用 PySpark做到这一点？

（我尝试过 MapType、udf 等，但未能成功。）

提前致谢！

Answer 1

应该做到以下几点：

import pyspark.sql.functions as f
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [
    ("100", "the boy wants go to school"),
    ("200", "he is a good boy"),
    ("300", "he likes to play football in the school"),
]
schema = ["id", "description"]
df = spark.createDataFrame(data, schema=schema)

df = (
    df.withColumn("word", f.explode(f.split(f.col("description"), " ")))
    .groupBy("word")
    .count()
    .sort("count", ascending=False)
)

res = df.rdd.map(lambda row: row.asDict()).collect()
res = {d["word"]: d["count"] for d in res}

print(res)

导出字典的另一种方法是将您的 Spark DataFrame 转换为 Pandas，如此处所示

PySpark 从字符串的数据框列创建外部字典

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-12-20 13:15:16

PySpark 从字符串的数据框列创建外部字典

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-12-20 13:15:16

解决方案1
1 已采纳 2022-12-20 13:15:16