从一个 PySpark 数据框中获取 ArrayType 列并在另一个数据框中获取相应的值

Question

I have two dataframes, one called itemsets and another called rules from running FPGrowth.我有两个数据框，一个称为项集，另一个称为运行 FPGrowth 的规则。 They are formatted like so:它们的格式如下：

ITEMSETS DATAFRAME
+------------------------+-----+
|cart                    |freq |
+------------------------+-----+
|[7084781116]            |10492|
|[7084781116, 2606500532]|362  |
|[7084781116, 0]         |327  |
|[7084781116, 2001]      |393  |
|[7084781116, 4011]      |489  |
|[4460030705]            |448  |
|[3800035800]            |539  |
|[3022304060]            |1188 |
|[2800021730]            |901  |
|[1380018805]            |437  |
+------------------------+-----+

RULES DATAFRAME
+--------------------+------------+
|          antecedent|  consequent|
+--------------------+------------+
|        [1686, 4068]|[2640000010]|
|[1675, 4432, 3680...|      [1673]|
|[1676, 1677, 3680...|      [1678]|
|[1676, 1677, 3680...|[3680080816]|
|[1676, 1677, 3680...|[3680044476]|
|[1676, 1677, 3680...|      [1675]|
|[7283040006, 7110...|      [1683]|
|[7283040006, 7110...|      [1682]|
|[1674, 4300000953...|      [1673]|
|[1674, 4300000953...|[3680044476]|
+--------------------+------------+

I need to add a few new columns to perform a calculation (to reverse the antecedent and consequent relationship, if you're curious.)我需要添加一些新列来执行计算（如果您好奇的话，可以颠倒前因和后因关系。）

In the first, I need to add a new column to the rules dataframe that has the corresponding frequencies from the itemsets dataframe.首先，我需要向规则数据框中添加一个新列，该列具有来自项集数据框的相应频率。 So for example, where the consequent is [7084781116], the new column will have the frequency of that array from the itemsets dataframe (10492, via the first row.)因此，例如，在结果是 [7084781116] 的情况下，新列将具有项目集数据帧中该数组的频率（10492，通过第一行。）

Next, I need to append the value of the consequent to the antecedent, and then do the same thing.接下来，我需要将结果的值附加到前因，然后做同样的事情。 So for example, looking at the second row in the rules column, I need to add 1673 to [1675, 4432, 3680...] and then get the frequency of THAT array from the itemsets table, and store it in another column.因此，例如，查看规则列中的第二行，我需要将 1673 添加到 [1675, 4432, 3680...] ，然后从项集表中获取该数组的频率，并将其存储在另一列中。

Can anyone help me out with this?谁能帮我解决这个问题？ I'm pretty new to PySpark and in over my head.I tried implementing several UDFs, for example with something like the following, with the hopes of converting the arrays to strings to possibly make them easier to work with:我对 PySpark 还很陌生，而且在我的头脑中。我尝试实现几个 UDF，例如使用以下内容，希望将数组转换为字符串，以便可能使它们更易于使用：

In [6]: conv_to_str([1,3,2,6,5])
Out[6]: '1|2|3|5|6|'

In [7]: def conv_to_str(listname):
        listname.sort()
        rv = ""
        for val in listname:
            rv += str(val) + "|"
        return rv[:-1]

In [8]: conv_to_str([1,3,2,6,5])
Out[8]: '1|2|3|5|6'

Thanks!谢谢！

Answer 1

I suggest these three steps:我建议这三个步骤：

For adding the frequency column use left-join on the consequent & cart columns.要添加频率列，请在consequent和cart列上使用左连接。
For adding the consequent value to the antecendent array use concat function (supported for arrays since Spark 2.4). antecendent consequent值添加到antecendent数组，请使用concat函数（自 Spark 2.4 起支持数组）。
Do the left-join again on the enriched antecendent column to get the frequency of this concatenated array.在丰富的antecendent列上再次执行左连接以获得此串联数组的频率。

So in PySpark the query could look like this:所以在 PySpark 中，查询可能如下所示：

(
  rules_df
  .withColumn('antecedent_enriched', concat('antecendent', 'consequent'[0]))
  .alias('a')
  .join(itemsets_df.alias('b'), col('a.consequent') == col('b.cart'), 'left')
  .join(itemsets_df.alias('c'), col('a.antecedent_enriched') == col('c.cart'), 'left'))
  .select(
    'antecedent',
    'consequent',
    'b.freq',
    'antecedent_enriched',
    col('c.freq').alias('freq_enriched')
  )
)

Also be careful when using the concat function since if the consequent column contains Null values the result of the concatenation will be also Null.使用concat函数时也要小心，因为如果consequent列包含 Null 值，则连接的结果也将为 Null。

从一个 PySpark 数据框中获取 ArrayType 列并在另一个数据框中获取相应的值

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-08-15 20:31:31

从一个 PySpark 数据框中获取 ArrayType 列并在另一个数据框中获取相应的值

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-08-15 20:31:31

解决方案1
1 已采纳 2019-08-15 20:31:31