具有类型列表的PySpark RDD转换为DataFrame

Question

I have an RDD in the following format: 我有以下格式的RDD：

 [(1, 
 (Rating(user=1, product=3, rating=0.99), 
  Rating(user=1, product=4, rating=0.91),  
  Rating(user=1, product=9, rating=0.68))),   
  (2, 
 (Rating(user=2, product=11, rating=1.01), 
  Rating(user=2, product=12, rating=0.98), 
  Rating(user=2, product=45, rating=0.97))), 
  (3, 
 (Rating(user=3, product=23, rating=1.01), 
  Rating(user=3, product=34, rating=0.99), 
  Rating(user=3, product=45, rating=0.98)))]

I'm have been unable to find any example of using map lambda etc to work with this kind of named data. 我一直找不到使用map lambda等来处理这种命名数据的任何示例。 Ideally, I would like the output to be a dataframe in the following format: 理想情况下，我希望输出为以下格式的数据框：

User    Ratings
1       3,0.99|4,0.91|9,0.68
2       11,1.01|12,0.98|45,0.97
3       23,1.01|34,0.99|45,0.98

Any pointers would be appreciated. 任何指针将不胜感激。 Note the number of ratings is variable and not just 3. 请注意，评分数是可变的，而不仅仅是3。

Answer 1

With RDD defined as RDD定义为

from pyspark.mllib.recommendation import Rating

rdd = sc.parallelize([
    (1,
        (Rating(user=1, product=3, rating=0.99), 
        Rating(user=1, product=4, rating=0.91),  
        Rating(user=1, product=9, rating=0.68))),   
    (2, 
        (Rating(user=2, product=11, rating=1.01), 
        Rating(user=2, product=12, rating=0.98), 
        Rating(user=2, product=45, rating=0.97))), 
    (3, 
        (Rating(user=3, product=23, rating=1.01), 
        Rating(user=3, product=34, rating=0.99), 
        Rating(user=3, product=45, rating=0.98)))])

you can mapValues with list : 您可以使用list mapValues ：

df = rdd.mapValues(list).toDF(["User", "Ratings"])

df.printSchema()
# root
#  |-- User: long (nullable = true)
#  |-- Ratings: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- user: long (nullable = true)
#  |    |    |-- product: long (nullable = true)
#  |    |    |-- rating: double (nullable = true)

or provide schema: 或提供架构：

df = spark.createDataFrame(rdd, "struct<User:long,ratings:array<struct<user:long,product:long,rating:double>>>")


df.printSchema()
# root
#  |-- User: long (nullable = true)
#  |-- ratings: array (nullable = true)
#  |    |-- element: struct (containsNull = true)
#  |    |    |-- user: long (nullable = true)
#  |    |    |-- product: long (nullable = true)
#  |    |    |-- rating: double (nullable = true)
# 

df.show()
# +----+--------------------+
# |User|             ratings|
# +----+--------------------+
# |   1|[[1,3,0.99], [1,4...|
# |   2|[[2,11,1.01], [2,...|
# |   3|[[3,23,1.01], [3,...|
# +----+--------------------+

If you want to drop user field: 如果要删除user字段：

df_without_user = spark.createDataFrame(
    rdd.mapValues(lambda xs: [x[1:] for x in xs]),
    "struct<User:long,ratings:array<struct<product:long,rating:double>>>"
)

If you want to format the column as a single string you have to use udf 如果要将列格式化为单个字符串，则必须使用udf

from pyspark.sql.functions import udf

@udf                                                                 
def format_ratings(ratings):
    return "|".join(",".join(str(_) for _ in r[1:]) for r in ratings)


df.withColumn("ratings", format_ratings("ratings")).show(3, False)

# +----+-----------------------+
# |User|ratings                |
# +----+-----------------------+
# |1   |3,0.99|4,0.91|9,0.68   |
# |2   |11,1.01|12,0.98|45,0.97|
# |3   |23,1.01|34,0.99|45,0.98|
# +----+-----------------------+

How "magic" works: “魔术”的工作原理：

Iterate over array of ratings 遍历一系列评分
```
 (... for r in ratings) 
```
For each rating drop the first field and convert remaining to str 对于每个评级，请删除第一个字段，然后将其余字段转换为str
```
 (str(_) for _ in r[1:]) 
```
Concatenate fields in rating with "," separator: 用“，”分隔符将等级中的字段连接起来：
```
 ",".join(str(_) for _ in r[1:]) 
```
Concatenate all rating strings with | 用|连接所有评级字符串|
```
 "|".join(",".join(str(_) for _ in r[1:]) for r in ratings) 
```

Alternative implementation: 替代实现：

@udf                                                                 
def format_ratings(ratings):
    return "|".join("{},{}".format(r.product, r.rating) for r in ratings)

具有类型列表的PySpark RDD转换为DataFrame

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-01-22 13:00:47

具有类型列表的PySpark RDD转换为DataFrame

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-01-22 13:00:47

解决方案1
1 已采纳 2018-01-22 13:00:47