简体   繁体   English

通过评估表达式连接两个 spark 数据帧

[英]Join two spark dataframes by evaluating an expression

I have two spark dataframes我有两个火花数据框

userMemberShipDF: userMemberShipDF:

user用户 membership_array会员数组
a1 a1 s1, s2, s3 s1、s2、s3
a2 a2 s4, s6 s4, s6
a3 a3 s5, s4,s3 s5、s4、s3
a4 a4 s1,s3,s4,s5 s1,s3,s4,s5
a5 a5 s2, s4, s6 s2、s4、s6
a6 a6 s3, s7, s1 s3、s7、s1
a7 a7 s1, s4, s6 s1、s4、s6

and categoryDF with和 categoryDF 与

category_id类别ID membership_expression会员表达式 start_date开始日期 duration期间
c1 c1 s1 || s1 || s2 s2 2022-05-01 2022-05-01 30 30
c2 c2 s4 && s6 && !s2 s4 && s6 && !s2 2022-06-20 2022-06-20 50 50
c3 c3 s3 && s4 s3 && s4 2022-06-10 2022-06-10 60 60

with resultant data frame to contain columns user, category_id, start_date, duration结果数据框包含user, category_id, start_date, duration

I already have a function written which would take in membership_expression from the second data frame along with membership_array in the first dataframe and evaluate to true or false.我已经编写了一个 function ,它将从第二个数据帧中获取membership_expression 以及第一个dataframe 中的membership_array 并评估为真或假。

For example membership_expression = s1 ||例如membership_expression = s1 || s2 would match all user a1, a4, a5, a6 and a7 and expression s4 && s6 !s2 would only match a2 etc. s2 将匹配所有用户 a1、a4、a5、a6 和 a7,而表达式s4 && s6 !s2将仅匹配 a2 等。

I wanted to join both the dataframes based on if this expression evaluates to true or false.我想根据此表达式的计算结果为真或假来加入两个数据框。 I looked up at the spark join and it would only take in column as join condition but not a boolean expression.我查看了火花连接,它只会将列作为连接条件,而不是 boolean 表达式。

So i have tried the below approach所以我尝试了以下方法

val matchedUserSegments = userMemberShipDF
      .map { r =>
        {
        // categoryDF is broadcasted
          val category_items_set = categoryDF.value.flatMap { fl =>
            {
              if (CategoryEvaluator.evaluateMemberShipExpression(fl.membership_expression, r.membership_array)) {
                Some(fl.category_id)
              } else {
                None
              }
            }
          }
          (r.user_id, category_items_set)
        }
      }
      .toDF("user_id", "category_items_set")

and then exploded the resultant dataframe on category_items_set and then join on the categoryDF to obtain the desired output table.然后在 category_items_set 上分解生成的 dataframe ,然后在 categoryDF 上加入,得到所需的 output 表。

I understand i am doing the operations twice but could not find a better way of calculating everything by iterating through both the dataframes just one.我知道我做了两次操作,但找不到更好的方法来通过迭代两个数据帧来计算所有内容。

Please suggest an efficient way of doing this.请提出一种有效的方法来做到这一点。

I have a lot of data and the spark job is taking more than 24 hrs to get this job done.我有很多数据,而 spark 工作需要 24 小时以上才能完成。 Thx谢谢

PS: To keep things simple, I've not included start_date and duration and also limited the sample user rows to a1, a2, a3, a4 . PS:为简单起见,我没有包含start_dateduration ,并且还将示例user行限制为a1, a2, a3, a4 Output shown here may not extactly match your expected output;此处显示的 Output 可能与您预期的 output 不完全匹配; but if you use full data, I'm sure the ouput will match.但如果你使用完整的数据,我相信输出会匹配。

import pyspark.sql.functions as F

userMemberShipDF = spark.createDataFrame([
    ("a1",["s1","s2","s3"]),
    ("a2",["s4","s6"]),
    ("a3",["s5","s4","s3"]),
    ("a4",["s1","s3","s4","s5"]),
], ["user","membership_array"])

Convert each membership s1, s2, s3 etc. into individual columns and mark as true , if user has that membership:如果用户具有该成员资格,则将每个成员资格 s1、s2、s3 等转换为单独的列并标记为true

userMemberShipDF = userMemberShipDF.withColumn("membership_individual", F.explode("membership_array"))
+----+----------------+---------------------+
|user|membership_array|membership_individual|
+----+----------------+---------------------+
|  a1|    [s1, s2, s3]|                   s1|
|  a1|    [s1, s2, s3]|                   s2|
|  a1|    [s1, s2, s3]|                   s3|
|  a2|        [s4, s6]|                   s4|
|  a2|        [s4, s6]|                   s6|
|  a3|    [s5, s4, s3]|                   s5|
|  a3|    [s5, s4, s3]|                   s4|
|  a3|    [s5, s4, s3]|                   s3|
|  a4|[s1, s3, s4, s5]|                   s1|
|  a4|[s1, s3, s4, s5]|                   s3|
|  a4|[s1, s3, s4, s5]|                   s4|
|  a4|[s1, s3, s4, s5]|                   s5|
+----+----------------+---------------------+


userMemberShipDF = userMemberShipDF.groupBy("user").pivot("membership_individual").agg(F.count("*").isNotNull()).na.fill(False)
+----+-----+-----+-----+-----+-----+-----+
|user|   s1|   s2|   s3|   s4|   s5|   s6|
+----+-----+-----+-----+-----+-----+-----+
|  a3|false|false| true| true| true|false|
|  a4| true|false| true| true| true|false|
|  a2|false|false|false| true|false| true|
|  a1| true| true| true|false|false|false|
+----+-----+-----+-----+-----+-----+-----+

In category data, replace ||在类别数据中,替换|| , && , ! , && , ! with or , and , not :or , and , not :

categoryDF = spark.createDataFrame([
    ("c1", "s1 || s2"),
    ("c2", "s4 && s6 && !s2"),
    ("c3", "s3 && s4"),
], ["category_id", "membership_expression"])

categoryDF = categoryDF.withColumn("membership_expression", F.regexp_replace("membership_expression", "\|\|", " or "))
categoryDF = categoryDF.withColumn("membership_expression", F.regexp_replace("membership_expression", "\&\&", " and "))
categoryDF = categoryDF.withColumn("membership_expression", F.regexp_replace("membership_expression", "\!", " not "))

+-----------+-------------------------+
|category_id|membership_expression    |
+-----------+-------------------------+
|c1         |s1  or  s2               |
|c2         |s4  and  s6  and   not s2|
|c3         |s3  and  s4              |
+-----------+-------------------------+

Cross join user and category data to evaluate each user against each category:交叉连接用户和类别数据以针对每个类别评估每个用户:

resultDF_sp = categoryDF.crossJoin(userMemberShipDF)
+-----------+-------------------------+----+-----+-----+-----+-----+-----+-----+
|category_id|membership_expression    |user|s1   |s2   |s3   |s4   |s5   |s6   |
+-----------+-------------------------+----+-----+-----+-----+-----+-----+-----+
|c1         |s1  or  s2               |a3  |false|false|true |true |true |false|
|c1         |s1  or  s2               |a4  |true |false|true |true |true |false|
|c1         |s1  or  s2               |a2  |false|false|false|true |false|true |
|c1         |s1  or  s2               |a1  |true |true |true |false|false|false|
|c2         |s4  and  s6  and   not s2|a3  |false|false|true |true |true |false|
|c2         |s4  and  s6  and   not s2|a4  |true |false|true |true |true |false|
|c2         |s4  and  s6  and   not s2|a2  |false|false|false|true |false|true |
|c2         |s4  and  s6  and   not s2|a1  |true |true |true |false|false|false|
|c3         |s3  and  s4              |a3  |false|false|true |true |true |false|
|c3         |s3  and  s4              |a4  |true |false|true |true |true |false|
|c3         |s3  and  s4              |a2  |false|false|false|true |false|true |
|c3         |s3  and  s4              |a1  |true |true |true |false|false|false|
+-----------+-------------------------+----+-----+-----+-----+-----+-----+-----+

Evaluate membership_expression评估membership_expression

Ahhh!啊! This part is not elegant这部分不优雅

Spark provides expr function to evaluate SQL expressions using column values; Spark 提供expr function来使用列值评估 SQL 表达式; but this works only if the expression is a static string:但这仅在表达式是 static 字符串时才有效:

resultDF_sp.select(F.expr("s1 or s2"))

But if the expression "s1 or s2" is a column value (like membership_expression column above), then there is no way to evaluate it.但如果表达式"s1 or s2"是一个列值(如上面的membership_expression列),则无法对其进行评估。 This results in error Column is not iterable :这导致错误Column is not iterable

resultDF_sp.select(F.expr(F.col("membership_expression")))

There are several questions on stackoverflow for this;关于这个,stackoverflow 有几个问题; but all of them suggest parsing expression, and writing an evaluator to manually evaluate the parsed expression:但他们都建议解析表达式,并编写一个评估器来手动评估解析的表达式:

Fortunately, it is possible to evaluate expression as column value using parameter values from other columns.幸运的是,可以使用其他列的参数值将表达式计算为列值。

So, the part I don't like;所以,我不喜欢的部分; but have no choice is to convert dataframe to pandas, evaluate expression and convert back to spark (if someone can suggest how to achieve this in spark, I'll be happy to include it in edit):但别无选择是将 dataframe 转换为 pandas,评估表达式并转换回火花(如果有人可以建议如何在火花中实现这一点,我很乐意将其包含在编辑中):

resultDF_pd = resultDF_sp.toPandas()

def evaluate_expr(row_series):
    df = row_series.to_frame().transpose().infer_objects()
    return df.eval(df["membership_expression"].values[0]).values[0]

resultDF_pd["is_matching_user"] = resultDF_pd.apply(lambda row: evaluate_expr(row), axis=1)
resultDF_sp = spark.createDataFrame(resultDF_pd[["category_id", "user", "is_matching_user"]])
+-----------+----+----------------+
|category_id|user|is_matching_user|
+-----------+----+----------------+
|         c1|  a3|           false|
|         c1|  a4|            true|
|         c1|  a2|           false|
|         c1|  a1|            true|
|         c2|  a3|           false|
|         c2|  a4|           false|
|         c2|  a2|            true|
|         c2|  a1|           false|
|         c3|  a3|            true|
|         c3|  a4|            true|
|         c3|  a2|           false|
|         c3|  a1|           false|
+-----------+----+----------------+

At last, filter the matching users:最后,过滤匹配的用户:

resultDF_sp = resultDF_sp.filter("is_matching_user")
+-----------+----+----------------+
|category_id|user|is_matching_user|
+-----------+----+----------------+
|         c1|  a4|            true|
|         c1|  a1|            true|
|         c2|  a2|            true|
|         c3|  a3|            true|
|         c3|  a4|            true|
+-----------+----+----------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM