![](/img/trans.png)
[英]Using Pyspark to create tuples from a line entry of list of words and count using RDD
[英]list of tuples to rdd with count using map reduce pyspark
我有一個類似以下的rdd:
[('C3', [{'Item': 'Shirt', 'Color ': 'Black', 'Size': '32','Price':'2500'}, {'Item': 'Sweater', 'Color ': 'Red', 'Size': '35', 'Price': '1000'}, {'Item': 'Jeans', 'Color ': 'Yellow', 'Size': '30', 'Price': '1500'}]), ('C1', [{'Item': 'Shirt', 'Color ': 'Green', 'Size': '25', 'Price': '2000'}, {'Item': 'Saree', 'Color ': 'Green', 'Size': '25', 'Price': '1500'}, {'Item': 'Saree', 'Color ': 'Green', 'Size': '25', 'Price': '1500'}, {'Item': 'Jeans', 'Color ': 'Yellow', 'Size': '30', 'Price': '1500'}])]
我們可以通過以下方式在rdd之上創建:
sc.parallelize([('C3', [{'Item': 'Shirt', 'Color ': 'Black', 'Size': '32','Price':'2500'}, {'Item': 'Sweater', 'Color ': 'Red', 'Size': '35', 'Price': '1000'}, {'Item': 'Jeans', 'Color ': 'Yellow', 'Size': '30', 'Price': '1500'}]), ('C1', [{'Item': 'Shirt', 'Color ': 'Green', 'Size': '25', 'Price': '2000'}, {'Item': 'Saree', 'Color ': 'Green', 'Size': '25', 'Price': '1500'}, {'Item': 'Saree', 'Color ': 'Green', 'Size': '25', 'Price': '1500'}, {'Item': 'Jeans', 'Color ': 'Yellow', 'Size': '30', 'Price': '1500'}])])
我需要創建類似於以下數據框/ rdd(我將計數添加到所有屬性)
{'C1': {'Color ': {'Green': 3, 'Yellow': 1},
'Item': {'Jeans': 1, 'Saree': 2, 'Shirt': 1},
'Price': {'1500': 3, '2000': 1},
'Size': {'25': 3, '30': 1}},
'C3': {'Color ': {'Black': 1, 'Red': 1, 'Yellow': 1},
'Item': {'Jeans': 1, 'Shirt': 1, 'Sweater': 1},
'Price': {'1000': 1, '1500': 1, '2500': 1},
'Size': {'30': 1, '32': 1, '35': 1}}}
相應的數據幀/ rdd將為:
+-------+---------------------------------------------------------------------
|custo |attr
|C1 |Map(Color -> Map(Green -> 3, yellow -> 1), Item -> Map(Jeans -> 1, Saree -> 2, Shirt ->1), Price -> |
+-------+-------------------------------------------------------------------------------------------------------+
使用udf收集計數。
from pyspark.sql import functions as f
from pyspark.sql import types as t
def count(c_dict):
res = {}
for item in c_dict:
print(type(item))
for key in item:
print(key, item[key])
if key in res:
if item[key] in res[key]:
res[key][item[key]]+= 1
else:
res[key][item[key]] = 1
else:
res[key]={}
res[key][item[key]] = 1
return(res)
schema = t.MapType(t.StringType(), t.MapType(t.StringType(), t.IntegerType()))
count_udf = f.udf(count, schema)
df2 = df.withColumn( 'col2' , count_udf(df.col2))
df.collect()
結果
[Row(col1='C3', col2={'Size': {'35': 1, '30': 1, '32': 1}, 'Price': {'2500': 1, '1500': 1, '1000': 1}, 'Item': {'Sweater': 1, 'Jeans': 1, 'Shirt': 1}, 'Color ': {'Red': 1, 'Black': 1, 'Yellow': 1}}),
Row(col1='C1', col2={'Size': {'25': 3, '30': 1}, 'Price': {'1500': 3, '2000': 1}, 'Item': {'Jeans': 1, 'Saree': 2, 'Shirt': 1}, 'Color ': {'Green': 3, 'Yellow': 1}})]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.