简体   繁体   English

如何创建动态过滤条件并使用它来过滤 spark 数据帧上的行?

[英]How to create a dynamic filter condition and use it to filter rows on a spark dataframe?

There are two variables that I read from a file and use it to filter to a dataframe.我从文件中读取了两个变量并使用它来过滤到数据帧。 Count of variables will vary on a daily basis so I need to build the filter condition dynamically so code change is not required.变量的数量每天都会变化,因此我需要动态构建过滤条件,因此不需要更改代码。

clientid1=1,2,3,4
clientref1=10,20,30,40 
clientid2=1,2,3,4
clientref2=10,20,30,40 

I read the above file row by row and split by "=" and load it to Map[String,String].我逐行读取上述文件并按“=”拆分并将其加载到 Map[String,String]。 In order to get the value, I pass the key "clientid1" and split the value by ","为了获取值,我传递了键“clientid1”并用“,”分割值

val clientid1 :Array[String] = parameterValues("clientid1").split(",")

I used these variables like below to filter the DF and the below is working fine.我使用下面的这些变量来过滤 DF,下面的工作正常。

val finalDF = customerDF
              .filter((col("clientid").isin(clientid1: _*) && col("clientref").isin(clientref1: _*))
                      ||(col("clientid").isin(clientid2: _*) && col("clientref").isin(clientref2: _*)))

Now I need to build the below part dynamically,现在我需要动态构建以下部分,

(col("clientid").isin(clientid1: _*) && col("clientref").isin(clientref1: _*))
                          ||(col("clientid").isin(clientid2: _*) && col("clientref").isin(clientref2: _*))

if clientid3 and clientref3 are present in the input file then the filter condition should be如果输入文件中存在 clientid3 和 clientref3,则过滤条件应为

(col("clientid").isin(clientid1: _*) && col("clientref").isin(clientref1: _*))
||(col("clientid").isin(clientid2: _*) && col("clientref").isin(clientref2: _*))
||(col("clientid").isin(clientid3: _*) && col("clientref").isin(clientref3: _*))

I used a function like below to generate the condition but it is returning a String and the filter condition is not accepting it,我使用了一个像下面这样的函数来生成条件,但它返回一个字符串并且过滤条件不接受它,

def generateFilterCond(count: Int, parameterValues: Map[String,String]): String ={
     var filterCond = ""
     for ( i <- 1 to count){
     val clientid :Array[String] = parameterValues("clientid$i").split(",")
     val clientref :Array[String] = parameterValues("clientref$i").split(",")

        filterCond = filterCond + (col("clientid").isin(clientid: _*) && col("clientref").isin(clientref: _*)) + " || " 
     }
        filterCond.substring(0, length-4).trim
}

Any suggestions on how to build this?关于如何构建它的任何建议?

Indeed, filter take a spark Column as parameter but the method generateFilterCond returns a String .实际上, filter将 spark Column作为参数,但generateFilterCond方法返回String Thus the first step is to change signature to return a Column因此第一步是更改签名以返回一个Column

However, we now have an issue because instead of initializing the filterCond variable in the first line of the function by an empty String , we need to initialize it with an empty Column , which does not exist in Spark.但是,我们现在遇到了一个问题,因为不是使用空String初始化函数第一行中的filterCond变量,而是需要使用空Column初始化它,而 Spark 中不存在该空Column

But as we know that we will pass the column to a filter, and that the condition we will write is only a sequence of or , we can initialize this col to a condition that will always be false, such as lit(false) .但是我们知道我们会将列传递给过滤器,并且我们将编写的条件只是一个or的序列,我们可以将此 col 初始化为一个始终为假的条件,例如lit(false)

Thus, we can rewrite your method as follow:因此,我们可以按如下方式重写您的方法:

def generateFilterCond(count: Int, parameterValues: Map[String,String]): Column = {
    var filterCond = lit(false)
    for (i <- 1 to count) {
      val clientid: Array[String] = parameterValues(s"clientid$i").split(",")
      val clientref: Array[String] = parameterValues(s"clientref$i").split(",")

      val addedCondition = col("clientid").isin(clientid: _*) && col("clientref").isin(clientref: _*)

      filterCond = filterCond || addedCondition
    }
    filterCond
  }

This method still have some issues.这种方法仍然存在一些问题。 For instance, a NotSuchElementException exception will be raised if the parameterValues map doesn't contain a specific clientid/clientref between 1 and count .例如,如果 parameterValues 映射不包含介于1count之间的特定 clientid/clientref,则会引发NotSuchElementException异常。 Moreover, it is not in a classic functional style, with a mutable accumulator and a loop.而且,它不是经典的函数式风格,带有可变累加器和循环。

So, assuming that we ignore all the problematic cases (no clientid/clientref couple or one of clientid or clientref doesn't exist but the other one exists), a final rewrite of this function could be:因此,假设我们忽略所有有问题的情况(没有 clientid/clientref 对或 clientid 或 clientref 中的一个不存在但另一个存在),此函数的最终重写可能是:

  def generateFilterCond(count: Int, parameterValues: Map[String, String]): Column = (1 to count)
    .map(i => (parameterValues.get(s"clientid$i"), parameterValues.get(s"clientref$i")))
    .flatMap(elem => for {
      clientIds <- elem._1
      clientRefs <- elem._2
    } yield col("clientid").isin(clientIds.split(","): _*) && col("clientref").isin(clientRefs: _*)
    ).foldLeft(lit(false))((acc, c) => acc || c)

Be also aware that you can reach issues if the column expression resulting of your dynamic filter condition is too big, so you should keep your variables file small另请注意,如果您的动态过滤条件产生的列表达式太大,您可能会遇到问题,因此您应该保持变量文件较小

To complete my previous answer , if your variables file contains a lot of lines, instead of trying to build a dynamic filter condition, you can read your variables file into a dataframe with spark and filter your input dataframe by joining it with the dataframe from your variables file为了完成我之前的回答,如果您的变量文件包含很多行,而不是尝试构建动态过滤条件,您可以使用 spark 将您的变量文件读入数据帧,并通过将其与来自您的数据帧的数据帧连接来过滤输入数据帧变量文件

Filter dataframe without dynamic filter condition过滤没有动态过滤条件的数据帧

In code below, your variable file is variables.txt , the dataframe on which you apply filter is input , and spark session is spark在下面的代码中,您的变量文件是variables.txt ,您应用过滤器的数据框是input ,而火花会话是spark

import spark.implicits._

def extractInfo(line: String): (Int, String, Seq[Int]) = {
  val mainId = line.dropWhile(_.isLetter).takeWhile(_.isDigit).toInt
  val kind = line.takeWhile(_.isLetter)
  val values = line.dropWhile(s => s.isLetterOrDigit).tail.split(",").map(_.toInt)
  (mainId, kind, values)
}

def explodeIdRef(infos: Seq[(Int, String, Seq[Int])]): Seq[(Int, Int)] = for {
  clientId <- infos.find(_._2 == "clientid").map(_._3.toSeq).getOrElse(Nil)
  clientRef <- infos.find(_._2 == "clientref").map(_._3.toSeq).getOrElse(Nil)
} yield (clientId, clientRef)

val config = spark.read.text("variables.txt")
  .as[String]
  .map(extractInfo)
  .groupByKey(_._1)
  .flatMapGroups((_, iterator) => explodeIdRef(iterator.toSeq))
  .toDF("clientid", "clientref")

val filtered = input.join(config, Seq("clientid", "clientref"))

extractInfo will parse a line from variables.txt and extract useful columns. extractInfo将从variables.txt解析一行并提取有用的列。 For instance, if the line is "clientid1=2,3" , extractInfo returns (1, "clientid", Seq(2, 3))例如,如果该行是"clientid1=2,3"extractInfo返回(1, "clientid", Seq(2, 3))

explodeIdRef will create all the couples clientid/clientref given a sequence of extracted info.给定一系列提取的信息, explodeIdRef将创建所有夫妇 clientid/clientref。 For instance if you have Seq((1, "clientid", Seq(2, 3)), (1, "clientref", Seq(20, 30))) , explodeIdRef returns Seq((2, 20), (2, 30), (3, 20), (3, 30))例如,如果您有Seq((1, "clientid", Seq(2, 3)), (1, "clientref", Seq(20, 30)))explodeIdRef返回Seq((2, 20), (2, 30), (3, 20), (3, 30))

Test测试

if your variables.txt contains the following lines:如果您的variables.txt包含以下几行:

clientid1=1,2,3,4
clientref1=10,20,30,40
clientid2=5,6,7,8
clientref2=50,60,70,80

And you have the following input dataframe:你有以下input数据框:

+---+--------+---------+
|id |clientid|clientref|
+---+--------+---------+
|1  |1       |10       |
|2  |2       |30       |
|3  |8       |10       |
|4  |8       |50       |
|5  |9       |90       |
+---+--------+---------+

The filtered dataframe will be: filtered数据框将是:

+---+--------+---------+
|id |clientid|clientref|
+---+--------+---------+
|1  |1       |10       |
|2  |2       |30       |
|4  |8       |50       |
+---+--------+---------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM