[英]Take ArrayType column from one PySpark dataframe and get corresponding value in another dataframe
I have two dataframes, one called itemsets and another called rules from running FPGrowth.我有两个数据框,一个称为项集,另一个称为运行 FPGrowth 的规则。 They are formatted like so:它们的格式如下:
ITEMSETS DATAFRAME
+------------------------+-----+
|cart |freq |
+------------------------+-----+
|[7084781116] |10492|
|[7084781116, 2606500532]|362 |
|[7084781116, 0] |327 |
|[7084781116, 2001] |393 |
|[7084781116, 4011] |489 |
|[4460030705] |448 |
|[3800035800] |539 |
|[3022304060] |1188 |
|[2800021730] |901 |
|[1380018805] |437 |
+------------------------+-----+
RULES DATAFRAME
+--------------------+------------+
| antecedent| consequent|
+--------------------+------------+
| [1686, 4068]|[2640000010]|
|[1675, 4432, 3680...| [1673]|
|[1676, 1677, 3680...| [1678]|
|[1676, 1677, 3680...|[3680080816]|
|[1676, 1677, 3680...|[3680044476]|
|[1676, 1677, 3680...| [1675]|
|[7283040006, 7110...| [1683]|
|[7283040006, 7110...| [1682]|
|[1674, 4300000953...| [1673]|
|[1674, 4300000953...|[3680044476]|
+--------------------+------------+
I need to add a few new columns to perform a calculation (to reverse the antecedent and consequent relationship, if you're curious.)我需要添加一些新列来执行计算(如果您好奇的话,可以颠倒前因和后因关系。)
In the first, I need to add a new column to the rules dataframe that has the corresponding frequencies from the itemsets dataframe.首先,我需要向规则数据框中添加一个新列,该列具有来自项集数据框的相应频率。 So for example, where the consequent is [7084781116], the new column will have the frequency of that array from the itemsets dataframe (10492, via the first row.)因此,例如,在结果是 [7084781116] 的情况下,新列将具有项目集数据帧中该数组的频率(10492,通过第一行。)
Next, I need to append the value of the consequent to the antecedent, and then do the same thing.接下来,我需要将结果的值附加到前因,然后做同样的事情。 So for example, looking at the second row in the rules column, I need to add 1673 to [1675, 4432, 3680...] and then get the frequency of THAT array from the itemsets table, and store it in another column.因此,例如,查看规则列中的第二行,我需要将 1673 添加到 [1675, 4432, 3680...] ,然后从项集表中获取该数组的频率,并将其存储在另一列中。
Can anyone help me out with this?谁能帮我解决这个问题? I'm pretty new to PySpark and in over my head.I tried implementing several UDFs, for example with something like the following, with the hopes of converting the arrays to strings to possibly make them easier to work with:我对 PySpark 还很陌生,而且在我的头脑中。我尝试实现几个 UDF,例如使用以下内容,希望将数组转换为字符串,以便可能使它们更易于使用:
In [6]: conv_to_str([1,3,2,6,5])
Out[6]: '1|2|3|5|6|'
In [7]: def conv_to_str(listname):
listname.sort()
rv = ""
for val in listname:
rv += str(val) + "|"
return rv[:-1]
In [8]: conv_to_str([1,3,2,6,5])
Out[8]: '1|2|3|5|6'
Thanks!谢谢!
I suggest these three steps:我建议这三个步骤:
consequent
& cart
columns.要添加频率列,请在consequent
和cart
列上使用左连接。consequent
value to the antecendent
array use concat function (supported for arrays since Spark 2.4). antecendent
consequent
值添加到antecendent
数组,请使用concat函数(自 Spark 2.4 起支持数组)。antecendent
column to get the frequency of this concatenated array.在丰富的antecendent
列上再次执行左连接以获得此串联数组的频率。So in PySpark the query could look like this:所以在 PySpark 中,查询可能如下所示:
(
rules_df
.withColumn('antecedent_enriched', concat('antecendent', 'consequent'[0]))
.alias('a')
.join(itemsets_df.alias('b'), col('a.consequent') == col('b.cart'), 'left')
.join(itemsets_df.alias('c'), col('a.antecedent_enriched') == col('c.cart'), 'left'))
.select(
'antecedent',
'consequent',
'b.freq',
'antecedent_enriched',
col('c.freq').alias('freq_enriched')
)
)
Also be careful when using the concat
function since if the consequent
column contains Null values the result of the concatenation will be also Null.使用concat
函数时也要小心,因为如果consequent
列包含 Null 值,则连接的结果也将为 Null。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.