简体   繁体   English

从一个 PySpark 数据框中获取 ArrayType 列并在另一个数据框中获取相应的值

[英]Take ArrayType column from one PySpark dataframe and get corresponding value in another dataframe

I have two dataframes, one called itemsets and another called rules from running FPGrowth.我有两个数据框,一个称为项集,另一个称为运行 FPGrowth 的规则。 They are formatted like so:它们的格式如下:

ITEMSETS DATAFRAME
+------------------------+-----+
|cart                    |freq |
+------------------------+-----+
|[7084781116]            |10492|
|[7084781116, 2606500532]|362  |
|[7084781116, 0]         |327  |
|[7084781116, 2001]      |393  |
|[7084781116, 4011]      |489  |
|[4460030705]            |448  |
|[3800035800]            |539  |
|[3022304060]            |1188 |
|[2800021730]            |901  |
|[1380018805]            |437  |
+------------------------+-----+

RULES DATAFRAME
+--------------------+------------+
|          antecedent|  consequent|
+--------------------+------------+
|        [1686, 4068]|[2640000010]|
|[1675, 4432, 3680...|      [1673]|
|[1676, 1677, 3680...|      [1678]|
|[1676, 1677, 3680...|[3680080816]|
|[1676, 1677, 3680...|[3680044476]|
|[1676, 1677, 3680...|      [1675]|
|[7283040006, 7110...|      [1683]|
|[7283040006, 7110...|      [1682]|
|[1674, 4300000953...|      [1673]|
|[1674, 4300000953...|[3680044476]|
+--------------------+------------+

I need to add a few new columns to perform a calculation (to reverse the antecedent and consequent relationship, if you're curious.)我需要添加一些新列来执行计算(如果您好奇的话,可以颠倒前因和后因关系。)

In the first, I need to add a new column to the rules dataframe that has the corresponding frequencies from the itemsets dataframe.首先,我需要向规则数据框中添加一个新列,该列具有来自项集数据框的相应频率。 So for example, where the consequent is [7084781116], the new column will have the frequency of that array from the itemsets dataframe (10492, via the first row.)因此,例如,在结果是 [7084781116] 的情况下,新列将具有项目集数据帧中该数组的频率(10492,通过第一行。)

Next, I need to append the value of the consequent to the antecedent, and then do the same thing.接下来,我需要将结果的值附加到前因,然后做同样的事情。 So for example, looking at the second row in the rules column, I need to add 1673 to [1675, 4432, 3680...] and then get the frequency of THAT array from the itemsets table, and store it in another column.因此,例如,查看规则列中的第二行,我需要将 1673 添加到 [1675, 4432, 3680...] ,然后从项集表中获取该数组的频率,并将其存储在另一列中。

Can anyone help me out with this?谁能帮我解决这个问题? I'm pretty new to PySpark and in over my head.I tried implementing several UDFs, for example with something like the following, with the hopes of converting the arrays to strings to possibly make them easier to work with:我对 PySpark 还很陌生,而且在我的头脑中。我尝试实现几个 UDF,例如使用以下内容,希望将数组转换为字符串,以便可能使它们更易于使用:

In [6]: conv_to_str([1,3,2,6,5])
Out[6]: '1|2|3|5|6|'

In [7]: def conv_to_str(listname):
        listname.sort()
        rv = ""
        for val in listname:
            rv += str(val) + "|"
        return rv[:-1]

In [8]: conv_to_str([1,3,2,6,5])
Out[8]: '1|2|3|5|6' 

Thanks!谢谢!

I suggest these three steps:我建议这三个步骤:

  1. For adding the frequency column use left-join on the consequent & cart columns.要添加频率列,请在consequentcart列上使用左连接。
  2. For adding the consequent value to the antecendent array use concat function (supported for arrays since Spark 2.4). antecendent consequent值添加到antecendent数组,请使用concat函数(自 Spark 2.4 起支持数组)。
  3. Do the left-join again on the enriched antecendent column to get the frequency of this concatenated array.在丰富的antecendent列上再次执行左连接以获得此串联数组的频率。

So in PySpark the query could look like this:所以在 PySpark 中,查询可能如下所示:

(
  rules_df
  .withColumn('antecedent_enriched', concat('antecendent', 'consequent'[0]))
  .alias('a')
  .join(itemsets_df.alias('b'), col('a.consequent') == col('b.cart'), 'left')
  .join(itemsets_df.alias('c'), col('a.antecedent_enriched') == col('c.cart'), 'left'))
  .select(
    'antecedent',
    'consequent',
    'b.freq',
    'antecedent_enriched',
    col('c.freq').alias('freq_enriched')
  )
)

Also be careful when using the concat function since if the consequent column contains Null values the result of the concatenation will be also Null.使用concat函数时也要小心,因为如果consequent列包含 Null 值,则连接的结果也将为 Null。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在pyspark中使用arraytype列创建数据框 - Create dataframe with arraytype column in pyspark 将 dataframe 分组在一列上,并从一列中获取最大值,从另一列中获取相应的值 - Group a dataframe on one column and take max from one column and its corresponding value from the other col 如何在 Pyspark 数据框中连接 2 列轴 = 1 上的 ArrayType? - How to Concat 2 column of ArrayType on axis = 1 in Pyspark dataframe? 将带有 StringType 列表的 PySpark DataFrame 列转换为 ArrayType - Convert PySpark DataFrame column with list in StringType to ArrayType Pyspark DataFrame 列基于另一个 DataFrame 值 - Pyspark DataFrame column based on another DataFrame value PySpark DataFrame - 从另一个 dataframe 创建一个列 - PySpark DataFrame - Create a column from another dataframe 从另一个 DataFrame 将列添加到 Pyspark DataFrame - Add column to Pyspark DataFrame from another DataFrame 如果一个dataframe值存在于另一个dataframe中,则从dataframe中取一个值 - If one dataframe value exists in another dataframe, then get a value from the dataframe 如何获取熊猫数据框列的最大值并在另一列中找到相应的值? - How do I take max value of a pandas dataframe column and find the corresponding value in another column? 使用来自另一个数据框的相应数据填充列值(合并??) - Populate column value with corresponding data from another dataframe (merge??)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM