[英]Perform a user defined function on a column of a large pyspark dataframe based on some columns of another pyspark dataframe on databricks
我的問題與我之前在如何有效地加入大型 pyspark 數據幀和小型 python 列表以獲得數據塊上的一些 NLP 結果相關。
我已經解決了其中的一部分,現在又遇到了另一個問題。
我有一個小的 pyspark 數據框,如:
df1:
+-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
|topic| termIndices| termWeights| terms|
+-----+--------------------------------------------------+--------------------------------------------------+--------------------------------------------------+
| 0| [3, 155, 108, 67, 239, 4, 72, 326, 128, 189]|[0.023463344607734377, 0.011772322769900843, 0....|[cell, apoptosis, uptake, loss, transcription, ...|
| 1| [16, 8, 161, 86, 368, 153, 18, 214, 21, 222]|[0.013057307487199429, 0.011453455929929763, 0....|[therapy, cancer, diet, lung, marker, sensitivi...|
| 2| [0, 1, 124, 29, 7, 2, 84, 299, 22, 90]|[0.03979063871841061, 0.026593954837078836, 0.0...|[group, expression, performance, use, disease, ...|
| 3| [204, 146, 74, 240, 152, 384, 55, 250, 238, 92]|[0.009305626056223443, 0.008840730657888991, 0....|[pattern, chemotherapy, mass, the amount, targe...|
它只有不到 100 行,而且非常小。 每個術語在“termWeights”列中都有一個 termWeight 值。
我有另一個大型 pyspark 數據框(50+ GB),例如:
df2:
+------+--------------------------------------------------+
|r_id| tokens|
+------+--------------------------------------------------+
| 0|[The human KCNJ9, Kir, GIRK3, member, potassium...|
| 1|[BACKGROUND, the treatment, breast, cancer, the...|
| 2|[OBJECTIVE, the relationship, preoperative atri...|
對於 df2 中的每一行,我需要在 df1 中找到所有主題中具有最高 termWeights 的最佳匹配項。
最后,我需要一個 df 像
r_id tokens topic (the topic in df1 that has the highest sum of termWeights among all topics)
我已經定義了一個 UDF(基於 df2),但它無法訪問 df1 的列。 我正在考慮如何對 df1 和 df2 使用“交叉連接”,但我不需要將 df2 的每一行與 df1 的每一行連接起來。 我只需要保留 df2 的所有列,並根據每個 df1 主題與每個 df2 行的術語的匹配項,添加具有最高 termWeights 總和的“主題”列。
我不確定如何通過 pyspark.sql.functions.udf 實現這個邏輯。
IIUC,您可以嘗試如下操作(我將處理流程分為4個步驟,需要Spark 2.4+ ):
步驟 1:將所有 df2.tokens 轉換為小寫,以便我們可以進行文本比較:
from pyspark.sql.functions import expr, desc, row_number, broadcast
df2 = df2.withColumn('tokens', expr("transform(tokens, x -> lower(x))"))
步驟 2:使用arrays_overlap將 df2 與 df1 左連接
df3 = df2.join(broadcast(df1), expr("arrays_overlap(terms, tokens)"), "left")
Step-3:使用聚合函數從terms 、 termWeights和tokens計算matched_sum_of_weights
df4 = df3.selectExpr(
"r_id",
"tokens",
"topic",
"""
aggregate(
/* find all terms+termWeights which are shown in tokens array */
filter(arrays_zip(terms,termWeights), x -> array_contains(tokens, x.terms)),
0D,
/* get the sum of all termWeights from the matched terms */
(acc, y) -> acc + y.termWeights
) as matched_sum_of_weights
""")
步驟 4:對於每個 r_id,使用 Window 函數找到具有最高matched_sum_of_weights
的行,並且只保留row_number == 1
行
from pyspark.sql import Window
w1 = Window.partitionBy('r_id').orderBy(desc('matched_sum_of_weights'))
df_new = df4.withColumn('rn', row_number().over(w1)).filter('rn=1').drop('rn', 'matched_sum_of_weights')
替代方案:如果 df1 的大小不是很大,這可能會在沒有 join/window.partition 等的情況下處理。下面的代碼僅概述了您應該根據實際數據改進的想法:
from pyspark.sql.functions import expr, when, coalesce, array_contains, lit, struct
# create a dict from df1 with topic as key and list of termWeights+terms as value
d = df1.selectExpr("string(topic)", "arrays_zip(termWeights,terms) as terms").rdd.collectAsMap()
# ignore this if text comparison are case-sensitive, you might do the same to df1 as well
df2 = df2.withColumn('tokens', expr("transform(tokens, x -> lower(x))"))
# save the column names of the original df2
cols = df2.columns
# iterate through all items of d(or df1) and update df2 with new columns from each
# topic with the value a struct containing `sum_of_weights`, `topic` and `has_match`(if any terms is matched)
for x,y in d.items():
df2 = df2.withColumn(x,
struct(
sum([when(array_contains('tokens', t.terms), t.termWeights).otherwise(0) for t in y]).alias('sum_of_weights'),
lit(x).alias('topic'),
coalesce(*[when(array_contains('tokens', t.terms),1) for t in y]).isNotNull().alias('has_match')
)
)
# create a new array containing all new columns(topics), and find array_max
# from items with `has_match == true`, and then retrieve the `topic` field
df_new = df2.selectExpr(
*cols,
f"array_max(filter(array({','.join(map('`{}`'.format,d.keys()))}), x -> x.has_match)).topic as topic"
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.