繁体   English   中英

PySpark 从字符串的数据框列创建外部字典

[英]PySpark create external dictionary from a dataframe column of strings

我有一个数据框如下:

data = [
    ("100", 'the boy wants go to school'),
    ("200", 'he is a good boy'),
    ("300", 'he likes to play football in the school')
]
schema = ['id', 'description']
df = spark.createDataFrame(data, schema=schema)

我想根据“描述”列中每一行中的单词创建一个外部词典(即不是一个新列;我需要稍后单独访问该词典)。

期望的输出即我的字典应该是:

the: 2
boy: 1
wants: 1
he: 2
school: 2
play: 1
...

我知道如何使用熊猫来做到这一点。 我怎样才能使用 PySpark做到这一点?

(我尝试过 MapType、udf 等,但未能成功。)

提前致谢!

应该做到以下几点:

import pyspark.sql.functions as f
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = [
    ("100", "the boy wants go to school"),
    ("200", "he is a good boy"),
    ("300", "he likes to play football in the school"),
]
schema = ["id", "description"]
df = spark.createDataFrame(data, schema=schema)

df = (
    df.withColumn("word", f.explode(f.split(f.col("description"), " ")))
    .groupBy("word")
    .count()
    .sort("count", ascending=False)
)

res = df.rdd.map(lambda row: row.asDict()).collect()
res = {d["word"]: d["count"] for d in res}

print(res)

导出字典的另一种方法是将您的 Spark DataFrame 转换为 Pandas,如此处所示

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM