基于数据块上另一个 pyspark 数据帧的某些列，在大型 pyspark 数据帧的列上执行用户定义的函数

Question

My question is relevant to my previous one at How to efficiently join large pyspark dataframes and small python list for some NLP results on databricks .我的问题与我之前在如何有效地加入大型 pyspark 数据帧和小型 python 列表以获得数据块上的一些 NLP 结果相关。

I have worked out part of it and now stuck by another problem.我已经解决了其中的一部分，现在又遇到了另一个问题。

I have a small pyspark dataframe like :我有一个小的 pyspark 数据框，如：

  df1: 

   +-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
   |topic|                                       termIndices|                                       termWeights|                                             terms|
   +-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
   |    0|      [3, 155, 108, 67, 239, 4, 72, 326, 128, 189]|[0.023463344607734377, 0.011772322769900843, 0....|[cell, apoptosis, uptake, loss, transcription, ...|
   |    1|      [16, 8, 161, 86, 368, 153, 18, 214, 21, 222]|[0.013057307487199429, 0.011453455929929763, 0....|[therapy, cancer, diet, lung, marker, sensitivi...|
   |    2|            [0, 1, 124, 29, 7, 2, 84, 299, 22, 90]|[0.03979063871841061, 0.026593954837078836, 0.0...|[group, expression, performance, use, disease, ...|
   |    3|   [204, 146, 74, 240, 152, 384, 55, 250, 238, 92]|[0.009305626056223443, 0.008840730657888991, 0....|[pattern, chemotherapy, mass, the amount, targe...|

It has less than 100 rows and very small.它只有不到 100 行，而且非常小。 Each term has a termWeight value in the column of "termWeights".每个术语在“termWeights”列中都有一个 termWeight 值。

I have another large pyspark dataframe (50+ GB) like:我有另一个大型 pyspark 数据框（50+ GB），例如：

  df2: 
  +------+--------------------------------------------------+
  |r_id|                                    tokens|
  +------+--------------------------------------------------+
  |     0|[The human KCNJ9, Kir, GIRK3, member, potassium...|
  |     1|[BACKGROUND, the treatment, breast, cancer, the...|
  |     2|[OBJECTIVE, the relationship, preoperative atri...|

For each row in df2, I need to find best matching terms in df1 with the highest termWeights among all topics.对于 df2 中的每一行，我需要在 df1 中找到所有主题中具有最高 termWeights 的最佳匹配项。

Finally, I need a df like最后，我需要一个 df 像

 r_id tokens topic (the topic in df1 that has the highest sum of termWeights among all topics)

I have defined a UDF (based on df2) but it cannot access the columns of df1.我已经定义了一个 UDF（基于 df2），但它无法访问 df1 的列。 I am thinking how to use "cross join" for df1 and df2 but I do not need to join each row of df2 with each row of df1.我正在考虑如何对 df1 和 df2 使用“交叉连接”，但我不需要将 df2 的每一行与 df1 的每一行连接起来。 I only need to keep all columns of df2 and add one column that is the "topic" with the highest sum of termWeights based on the matching terms of each df1's topic with the terms of each df2's row.我只需要保留 df2 的所有列，并根据每个 df1 主题与每个 df2 行的术语的匹配项，添加具有最高 termWeights 总和的“主题”列。

I am not sure how to implement this logic by pyspark.sql.functions.udf.我不确定如何通过 pyspark.sql.functions.udf 实现这个逻辑。

Answer 1

IIUC, you can try something like the following (I split the processing flow into 4 steps, Spark 2.4+ is required): IIUC，您可以尝试如下操作（我将处理流程分为4个步骤，需要Spark 2.4+ ）：

Step-1: convert all df2.tokens to lowercase so we can do text comparison:步骤 1：将所有 df2.tokens 转换为小写，以便我们可以进行文本比较：

from pyspark.sql.functions import expr, desc, row_number, broadcast

df2 = df2.withColumn('tokens', expr("transform(tokens, x -> lower(x))"))

Step-2: left-join df2 with df1 using arrays_overlap步骤 2：使用arrays_overlap将 df2 与 df1 左连接

df3 = df2.join(broadcast(df1), expr("arrays_overlap(terms, tokens)"), "left")

Step-3: use aggregate function to calculate matched_sum_of_weights from terms , termWeights and tokens Step-3：使用聚合函数从terms 、 termWeights和tokens计算matched_sum_of_weights

df4 = df3.selectExpr(
    "r_id",
    "tokens",
    "topic",
    """
      aggregate(
        /* find all terms+termWeights which are shown in tokens array */
        filter(arrays_zip(terms,termWeights), x -> array_contains(tokens, x.terms)),
        0D,
        /* get the sum of all termWeights from the matched terms */
        (acc, y) -> acc + y.termWeights
      ) as matched_sum_of_weights
    """)

Step-4: for each r_id, find the row with highest matched_sum_of_weights using Window function and only keep rows having row_number == 1步骤 4：对于每个 r_id，使用 Window 函数找到具有最高matched_sum_of_weights的行，并且只保留row_number == 1行

from pyspark.sql import Window
w1 = Window.partitionBy('r_id').orderBy(desc('matched_sum_of_weights'))

df_new = df4.withColumn('rn', row_number().over(w1)).filter('rn=1').drop('rn', 'matched_sum_of_weights')

Alternative: if the size of df1 is not very large, this might be handled without join/window.partition etc. below code only outlines the idea which you should improve based on your actual data:替代方案：如果 df1 的大小不是很大，这可能会在没有 join/window.partition 等的情况下处理。下面的代码仅概述了您应该根据实际数据改进的想法：

from pyspark.sql.functions import expr, when, coalesce, array_contains, lit, struct

# create a dict from df1 with topic as key and list of termWeights+terms as value
d = df1.selectExpr("string(topic)", "arrays_zip(termWeights,terms) as terms").rdd.collectAsMap()

# ignore this if text comparison are case-sensitive, you might do the same to df1 as well
df2 = df2.withColumn('tokens', expr("transform(tokens, x -> lower(x))"))

# save the column names of the original df2
cols = df2.columns

# iterate through all items of d(or df1) and update df2 with new columns from each 
# topic with the value a struct containing `sum_of_weights`, `topic` and `has_match`(if any terms is matched)
for x,y in d.items():
  df2 = df2.withColumn(x,
      struct(
        sum([when(array_contains('tokens', t.terms), t.termWeights).otherwise(0) for t in y]).alias('sum_of_weights'),
        lit(x).alias('topic'),
        coalesce(*[when(array_contains('tokens', t.terms),1) for t in y]).isNotNull().alias('has_match')
      )
  )

# create a new array containing all new columns(topics), and find array_max
# from items with `has_match == true`, and then retrieve the `topic` field
df_new = df2.selectExpr(
    *cols,
    f"array_max(filter(array({','.join(map('`{}`'.format,d.keys()))}), x -> x.has_match)).topic as topic"
)

基于数据块上另一个 pyspark 数据帧的某些列，在大型 pyspark 数据帧的列上执行用户定义的函数

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-09-07 03:15:00

基于数据块上另一个 pyspark 数据帧的某些列，在大型 pyspark 数据帧的列上执行用户定义的函数

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-09-07 03:15:00

解决方案1
1 已采纳 2020-09-07 03:15:00