简体   繁体   中英

Join two spark dataframes by evaluating an expression

I have two spark dataframes

userMemberShipDF:

user membership_array
a1 s1, s2, s3
a2 s4, s6
a3 s5, s4,s3
a4 s1,s3,s4,s5
a5 s2, s4, s6
a6 s3, s7, s1
a7 s1, s4, s6

and categoryDF with

category_id membership_expression start_date duration
c1 s1 || s2 2022-05-01 30
c2 s4 && s6 && !s2 2022-06-20 50
c3 s3 && s4 2022-06-10 60

with resultant data frame to contain columns user, category_id, start_date, duration

I already have a function written which would take in membership_expression from the second data frame along with membership_array in the first dataframe and evaluate to true or false.

For example membership_expression = s1 || s2 would match all user a1, a4, a5, a6 and a7 and expression s4 && s6 !s2 would only match a2 etc.

I wanted to join both the dataframes based on if this expression evaluates to true or false. I looked up at the spark join and it would only take in column as join condition but not a boolean expression.

So i have tried the below approach

val matchedUserSegments = userMemberShipDF
      .map { r =>
        {
        // categoryDF is broadcasted
          val category_items_set = categoryDF.value.flatMap { fl =>
            {
              if (CategoryEvaluator.evaluateMemberShipExpression(fl.membership_expression, r.membership_array)) {
                Some(fl.category_id)
              } else {
                None
              }
            }
          }
          (r.user_id, category_items_set)
        }
      }
      .toDF("user_id", "category_items_set")

and then exploded the resultant dataframe on category_items_set and then join on the categoryDF to obtain the desired output table.

I understand i am doing the operations twice but could not find a better way of calculating everything by iterating through both the dataframes just one.

Please suggest an efficient way of doing this.

I have a lot of data and the spark job is taking more than 24 hrs to get this job done. Thx

PS: To keep things simple, I've not included start_date and duration and also limited the sample user rows to a1, a2, a3, a4 . Output shown here may not extactly match your expected output; but if you use full data, I'm sure the ouput will match.

import pyspark.sql.functions as F

userMemberShipDF = spark.createDataFrame([
    ("a1",["s1","s2","s3"]),
    ("a2",["s4","s6"]),
    ("a3",["s5","s4","s3"]),
    ("a4",["s1","s3","s4","s5"]),
], ["user","membership_array"])

Convert each membership s1, s2, s3 etc. into individual columns and mark as true , if user has that membership:

userMemberShipDF = userMemberShipDF.withColumn("membership_individual", F.explode("membership_array"))
+----+----------------+---------------------+
|user|membership_array|membership_individual|
+----+----------------+---------------------+
|  a1|    [s1, s2, s3]|                   s1|
|  a1|    [s1, s2, s3]|                   s2|
|  a1|    [s1, s2, s3]|                   s3|
|  a2|        [s4, s6]|                   s4|
|  a2|        [s4, s6]|                   s6|
|  a3|    [s5, s4, s3]|                   s5|
|  a3|    [s5, s4, s3]|                   s4|
|  a3|    [s5, s4, s3]|                   s3|
|  a4|[s1, s3, s4, s5]|                   s1|
|  a4|[s1, s3, s4, s5]|                   s3|
|  a4|[s1, s3, s4, s5]|                   s4|
|  a4|[s1, s3, s4, s5]|                   s5|
+----+----------------+---------------------+


userMemberShipDF = userMemberShipDF.groupBy("user").pivot("membership_individual").agg(F.count("*").isNotNull()).na.fill(False)
+----+-----+-----+-----+-----+-----+-----+
|user|   s1|   s2|   s3|   s4|   s5|   s6|
+----+-----+-----+-----+-----+-----+-----+
|  a3|false|false| true| true| true|false|
|  a4| true|false| true| true| true|false|
|  a2|false|false|false| true|false| true|
|  a1| true| true| true|false|false|false|
+----+-----+-----+-----+-----+-----+-----+

In category data, replace || , && , ! with or , and , not :

categoryDF = spark.createDataFrame([
    ("c1", "s1 || s2"),
    ("c2", "s4 && s6 && !s2"),
    ("c3", "s3 && s4"),
], ["category_id", "membership_expression"])

categoryDF = categoryDF.withColumn("membership_expression", F.regexp_replace("membership_expression", "\|\|", " or "))
categoryDF = categoryDF.withColumn("membership_expression", F.regexp_replace("membership_expression", "\&\&", " and "))
categoryDF = categoryDF.withColumn("membership_expression", F.regexp_replace("membership_expression", "\!", " not "))

+-----------+-------------------------+
|category_id|membership_expression    |
+-----------+-------------------------+
|c1         |s1  or  s2               |
|c2         |s4  and  s6  and   not s2|
|c3         |s3  and  s4              |
+-----------+-------------------------+

Cross join user and category data to evaluate each user against each category:

resultDF_sp = categoryDF.crossJoin(userMemberShipDF)
+-----------+-------------------------+----+-----+-----+-----+-----+-----+-----+
|category_id|membership_expression    |user|s1   |s2   |s3   |s4   |s5   |s6   |
+-----------+-------------------------+----+-----+-----+-----+-----+-----+-----+
|c1         |s1  or  s2               |a3  |false|false|true |true |true |false|
|c1         |s1  or  s2               |a4  |true |false|true |true |true |false|
|c1         |s1  or  s2               |a2  |false|false|false|true |false|true |
|c1         |s1  or  s2               |a1  |true |true |true |false|false|false|
|c2         |s4  and  s6  and   not s2|a3  |false|false|true |true |true |false|
|c2         |s4  and  s6  and   not s2|a4  |true |false|true |true |true |false|
|c2         |s4  and  s6  and   not s2|a2  |false|false|false|true |false|true |
|c2         |s4  and  s6  and   not s2|a1  |true |true |true |false|false|false|
|c3         |s3  and  s4              |a3  |false|false|true |true |true |false|
|c3         |s3  and  s4              |a4  |true |false|true |true |true |false|
|c3         |s3  and  s4              |a2  |false|false|false|true |false|true |
|c3         |s3  and  s4              |a1  |true |true |true |false|false|false|
+-----------+-------------------------+----+-----+-----+-----+-----+-----+-----+

Evaluate membership_expression

Ahhh! This part is not elegant

Spark provides expr function to evaluate SQL expressions using column values; but this works only if the expression is a static string:

resultDF_sp.select(F.expr("s1 or s2"))

But if the expression "s1 or s2" is a column value (like membership_expression column above), then there is no way to evaluate it. This results in error Column is not iterable :

resultDF_sp.select(F.expr(F.col("membership_expression")))

There are several questions on stackoverflow for this; but all of them suggest parsing expression, and writing an evaluator to manually evaluate the parsed expression:

Fortunately, it is possible to evaluate expression as column value using parameter values from other columns.

So, the part I don't like; but have no choice is to convert dataframe to pandas, evaluate expression and convert back to spark (if someone can suggest how to achieve this in spark, I'll be happy to include it in edit):

resultDF_pd = resultDF_sp.toPandas()

def evaluate_expr(row_series):
    df = row_series.to_frame().transpose().infer_objects()
    return df.eval(df["membership_expression"].values[0]).values[0]

resultDF_pd["is_matching_user"] = resultDF_pd.apply(lambda row: evaluate_expr(row), axis=1)
resultDF_sp = spark.createDataFrame(resultDF_pd[["category_id", "user", "is_matching_user"]])
+-----------+----+----------------+
|category_id|user|is_matching_user|
+-----------+----+----------------+
|         c1|  a3|           false|
|         c1|  a4|            true|
|         c1|  a2|           false|
|         c1|  a1|            true|
|         c2|  a3|           false|
|         c2|  a4|           false|
|         c2|  a2|            true|
|         c2|  a1|           false|
|         c3|  a3|            true|
|         c3|  a4|            true|
|         c3|  a2|           false|
|         c3|  a1|           false|
+-----------+----+----------------+

At last, filter the matching users:

resultDF_sp = resultDF_sp.filter("is_matching_user")
+-----------+----+----------------+
|category_id|user|is_matching_user|
+-----------+----+----------------+
|         c1|  a4|            true|
|         c1|  a1|            true|
|         c2|  a2|            true|
|         c3|  a3|            true|
|         c3|  a4|            true|
+-----------+----+----------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM