简体   繁体   English

PySpark 基于多个参数化条件的join

[英]PySpark join based on multiple parameterized conditions

So I have two pyspark dataframes.所以我有两个 pyspark 数据帧。 Let's call them A and B. I want to perform a left join based on multiple conditions.我们称它们为 A 和 B。我想根据多个条件执行左连接。 Let's say the column names on which to join are the following:假设要加入的列名如下:

cond= [A.columnA1==B.columnB1, A.columnA2==B.columnB2]
df=A.join(B,cond,'left')

Now, what if I don't know the column names in advance, and want to parameterize this?现在,如果我事先不知道列名,并且想对其进行参数化怎么办? Imagine the user is allowed to pass two lists containing column names on which to join (which may be more than 2 columns per list, we don't know)想象一下,允许用户传递两个包含要加入的列名的列表(每个列表可能超过 2 列,我们不知道)

Imagine, we have the following list of columns on which we want to join, which takes input from users:想象一下,我们有以下要加入的列列表,它接受用户的输入:

columnlistA=[]
columnlistB=[]

User will pass any number of column inputs for both these lists, but it will always be the same number for both of these two lists, such that the first element of columnlistA corresponds to the 1st element of columnlistB while joining and so on for corresponding elements.用户将为这两个列表传递任意数量的列输入,但对于这两个列表,它始终是相同的数字,因此 columnlistA 的第一个元素对应于 columnlistB 的第一个元素,而连接时对应的元素以此类推. Then, how do I write the join so that I can make use of these columnlist parameters to be included in the join condition for these dataframes?那么,如何编写连接,以便可以利用这些列列表参数包含在这些数据帧的连接条件中?

You can do that by using aliases for your dataframes.您可以通过为数据框使用别名来做到这一点。 Like that, you can access them when you refer to their column names as simple strings.像这样,当您将它们的列名称为简单字符串时,您可以访问它们。

If I alias a dataframe as myDataFrame , I can refer to its columns in a string like that:如果我将 dataframe 别名为myDataFrame ,我可以在这样的字符串中引用它的列:

import pyspark.sql.functions as F
df = spark.createDataFrame(.....)
aliased_df = df.alias("myDataFrame")
F.col("myDataFrame.columnName")  # this is the same as df.columnName

So you can use that to construct a list with your columns dynamically specified:因此,您可以使用它来构造一个动态指定列的列表:

A.alias("dfA").join(
  B.alias("dfB"),
  [F.col("dfA."+col_a) == F.col("dfB."+col_b) for col_a, col_b in zip(columnlistA, columnlistB)],
  'left'
)

The following solution is based on two lists from which it will generate the join conditions.以下解决方案基于两个列表,它将从中生成连接条件。 It assumes that the equality operator between column is always == .它假定列之间的相等运算符始终为== You can control the binary operator between the conditions by specifying the op argument (only [or, and] are allowed).您可以通过指定op参数来控制条件之间的二元运算符(仅允许[or, and] )。

from pyspark.sql.functions import col
from functools import reduce
from pyspark.sql import Column
from pyspark.sql.column import _bin_op

def generate_conditions(left_cols: list, right_cols: list, op: str = "or") -> Column:
  if(not left_cols or not right_cols):
    raise Exception("The lists should not be emtpy.")
  
  if(len(left_cols) != len(right_cols)):
    raise Exception("The lists should have same length.")
    
  if(op not in ["and", "or"]):
    raise Exception("Only [and, or] binary operators are allowed.")
    
  condition_list = reduce(lambda x,y: _bin_op(op)(x, y), [(col(l) == col(r)) for l, r in zip(left_cols, right_cols)])
  
  return condition_list

l = ["a1", "a2", "a3"]
r = ["b1", "b2", "b3"]

join_conditions = generate_conditions(l, r, "or")

print(join_conditions)
# Column<'(((a1 = b1) OR (a2 = b2)) OR (a3 = b3))'>

Now you can use it in your join as A.join(B, join_conditions, 'left')现在你可以在你的连接中使用它作为A.join(B, join_conditions, 'left')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM