Pyspark + 关联规则挖掘：如何将数据帧转换为适合频繁模式挖掘的格式？

Question

I am trying to use pyspark to do association rule mining.我正在尝试使用 pyspark 进行关联规则挖掘。 Let's say my data is like:假设我的数据是这样的：

myItems=spark.createDataFrame([(1,'a'),
                               (1,'b'),
                               (1,'d'),
                               (1,'c'),
                               (2,'a'),
                               (2,'c'),],
                              ['id','item'])

But according to https://spark.apache.org/docs/2.2.0/ml-frequent-pattern-mining.html , the format should be:但根据https://spark.apache.org/docs/2.2.0/ml-frequent-pattern-mining.html ，格式应该是：

df = spark.createDataFrame([(1, ['a', 'b', 'd','c']),
                            (2, ['a', 'c'])], 
                           ["id", "items"])

So I need to transfer my data from vertical to horizontal and the lengths for all the ids are different.所以我需要将我的数据从垂直传输到水平，并且所有 id 的长度都不同。

How can I do this transfer, or is there another way to do it?我该如何进行这种转移，或者有其他方法可以做到吗？

Answer 1

Let your original definition of myItems be valid.让您对myItems的原始定义有效。 collect_list will be helpful after you typically group the dataframe by id.在您通常按 id 对数据collect_list进行group后， collect_list会有所帮助。

>>> myItems=spark.createDataFrame([(1,'a'),
...                                (1,'b'),
...                                (1,'d'),
...                                (1,'c'),
...                                (2,'a'),
...                                (2,'c'),],
...                               ['id','item'])
>>> from pyspark.sql.functions import collect_list
>>> myItems.groupBy(myItems.id).agg(collect_list('item')).show()
+---+------------------+
| id|collect_list(item)|
+---+------------------+
|  1|      [a, b, d, c]|
|  2|            [a, c]|
+---+------------------+

Pyspark + 关联规则挖掘：如何将数据帧转换为适合频繁模式挖掘的格式？

问题描述

1 个解决方案

解决方案1
3 已采纳 2019-04-08 07:54:43

Pyspark + 关联规则挖掘：如何将数据帧转换为适合频繁模式挖掘的格式？

问题描述

1 个解决方案

解决方案1 3 已采纳 2019-04-08 07:54:43

解决方案1
3 已采纳 2019-04-08 07:54:43