[英]how to concat the same column value to a new column with comma delimiters in spark
The format of input data as follows:输入数据的格式如下:
+--------------------+-------------+--------------------+
| date | user | product |
+--------------------+-------------+--------------------+
| 2016-10-01 | Tom | computer |
+--------------------+-------------+--------------------+
| 2016-10-01 | Tom | iphone |
+--------------------+-------------+--------------------+
| 2016-10-01 | Jhon | book |
+--------------------+-------------+--------------------+
| 2016-10-02 | Tom | pen |
+--------------------+-------------+--------------------+
| 2016-10-02 | Jhon | milk |
+--------------------+-------------+--------------------+
And the format of output as follows:输出格式如下:
+-----------+-----------------------+
| user | products |
+-----------------------------------+
| Tom | computer,iphone,pen |
+-----------------------------------+
| Jhon | book,milk |
+-----------------------------------+
The output shows all products every user bought order by date.输出显示每个用户按日期购买的所有产品。
I want to process these data using Spark, who Can you help me, please?我想用 Spark 处理这些数据,请问谁能帮帮我? Thank you.谢谢。
Better to use map-reduceBykey() combination rather than groupBy.. Also assuming the data doesn't have the最好使用 map-reduceBykey() 组合而不是 groupBy .. 还假设数据没有
#Read the data using val ordersRDD = sc.textFile("/file/path")
val ordersRDD = sc.parallelize( List(("2016-10-01","Tom","computer"),
("2016-10-01","Tom","iphone"),
("2016-10-01","Jhon","book"),
("2016-10-02","Tom","pen"),
("2016-10-02","Jhon","milk")))
#group by (date, user), sort by key & reduce by user & concatenate products
val dtusrGrpRDD = ordersRDD.map(rec => ((rec._2, rec._1), rec._3))
.sortByKey().map(x=>(x._1._1, x._2))
.reduceByKey((acc, v) => acc+","+v)
#if needed, make it to DF
scala> dtusrGrpRDD.toDF("user", "product").show()
+----+-------------------+
|user| product|
+----+-------------------+
| Tom|computer,iphone,pen|
|Jhon| book,milk|
+----+-------------------+
If you are using a HiveContext (which you should be):如果您使用的是 HiveContext(您应该使用):
Example using python:使用python的示例:
from pyspark.sql.functions import collect_set
df = ... load your df ...
new_df = df.groupBy("user").agg(collect_set("product").alias("products"))
If you don't want the resulting list in products deduped, you can use collect_list instead.如果您不希望对产品中的结果列表进行重复数据删除,则可以改用 collect_list。
For dataframes it is two-liner:对于数据帧,它是两行的:
import org.apache.spark.sql.functions.collect_list
//collect_set nistead of collect_list if you don't want duplicates
val output = join.groupBy("user").agg(collect_list($"product"))
GroupBy will give you a grouped user set post which you can iterate and collect_list or collect_set on the grouped dataset. GroupBy 会给你一个分组的用户集帖子,你可以在分组的数据集上迭代和 collect_list 或 collect_set 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.