[英]create a dataframe from dictionary by using RDD in pyspark
I have a dictionary that name is “Word_Count” , key is represent the word and values represent the number word in text. 我有一本字典,名称是“ Word_Count”,键表示单词,值表示文本中的数字单词。 My aim is to convert it to a dataframe with two columns words and count
我的目的是将其转换为具有两列单词和计数的数据框
items = list(Word_Counts.items())[:5]
items
output: 输出:
[('Akdeniz’in', 14), ('en', 13287), ('büyük', 3168), ('deniz', 1276), ('festivali:', 6)]
When I used sc.parallelize to establish a RDD , I realized that it drop all values and only keys remain as a result when I create a table , it contains only from keys. 当我使用sc.parallelize建立一个RDD时,我意识到它会删除所有值,并且在创建表时仅保留键,因此它仅包含来自键。 Please let me know how can establish a dataframe from a dictionary by using RDD
请让我知道如何使用RDD从字典建立数据框
rdd1 = sc.parallelize(Word_Counts)
Df_Hur = spark.read.json(rdd1)
rdd1.take(5)
output: 输出:
['Akdeniz’in', 'en', 'büyük', 'deniz', 'festivali:']
Df_Hur.show(5)
output: 输出:
+---------------+
|_corrupt_record|
+---------------+
| Akdeniz’in|
| en|
| büyük|
| deniz|
| festivali:|
+---------------+
My aim is : 我的目标是:
word count
Akdeniz’in 14
en 13287
büyük 3168
deniz 1276
festivali: 6
You can feed word_count.items()
directly to parallelize
: 您可以直接喂
word_count.items()
来parallelize
:
df_hur = sc.parallelize(word_count.items()).toDF(['word', 'count'])
df_hur.show()
>>>
+----------+-----+
| word|count|
+----------+-----+
|Akdeniz’in| 14|
| en|13287|
| büyük| 3168|
| deniz| 1276|
|festivali:| 6|
+----------+-----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.