简体   繁体   English

Scala + Spark:过滤包含列表元素的数据集

[英]Scala + Spark: filter a dataset if it contains elements from a list

I have a dataset and I want to filtered base on a column.我有一个数据集,我想根据列进行过滤。

val test = Seq(
("1", "r2_test"),
("2", "some_other_value"),
("3", "hs_2_card"),
("4", "vsx_np_v2"),
("5", "r2_test"),
("2", "some_other_value2")
).toDF("id", "my_column")

I want to create a function to filter my dataframe based on the elements of this list using contains on "my_column"(if contains part of the string, the filter must be applied)我想创建一个函数来根据此列表的元素过滤我的数据框,使用“my_column”上的包含(如果包含字符串的一部分,则必须应用过滤器)

import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._

def filteredElements(df: DataFrame): DataFrame = {
   val elements = List("r2", "hs", "np")
   df.filter($"my_column".contains(elements))
}

But like this, won't work for a list, just for a single element.但是像这样,不适用于列表,只适用于单个元素。 How can I do to adapt to use my list without have to do multiple filters?我该怎么做才能适应使用我的列表而不必执行多个过滤器?

Below how the expected output must be when apply the function下面是应用函数时预期的输出必须如何

val output = test.transform(filteredElements)

expected =
("1", "r2_test"), // contains "rs"
("3", "hs_2_card"), // contains "hs"
("4", "vsx_np_v2"), // contains "np"
("5", "r2_test"), // contains "r2"

One way to solve this would be to use a UDF .解决这个问题的一种方法是使用UDF I think there should be some way to solve this with spark sql functions that I'm not aware of.我认为应该有一些方法可以用我不知道的 spark sql 函数来解决这个问题。 Anyway, you can define a udf to tell weather a String contains any of the values in your elements List or not:无论如何,您可以定义一个 udf 来告诉天气 String 是否包含元素列表中的任何值:

import org.apache.sql.functions._
val elements = List("r2", "hs", "np")

val isContainedInList = udf { (value: String) => 
  elements.exists(e => value.indexOf(e) != -1)
}

You can use this udf in select, filter, basically anywhere you want:您可以在选择、筛选中使用这个 udf,基本上在任何您想要的地方:

def filteredElements(df: DataFrame): DataFrame = {
   df.filter(isContainedInList($"my_column"))
}

And the result is as expected:结果如预期的那样:

+---+---------+
| id|my_column|
+---+---------+
|  1|  r2_test|
|  3|hs_2_card|
|  4|vsx_np_v2|
|  5|  r2_test|
+---+---------+

You can do it in one line without udf ( better for performance and simpler ):您可以在没有 udf 的情况下在一行中完成(性能更好且更简单):

df.filter(col("my_column").isNotNull).filter(row => elements.exists(row.getAs[String]("my_column").contains)).show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM