简体   繁体   English

基于值列表的 spark dataframe 中的过滤条件

[英]Filter condition in spark dataframe based on list of values

I am trying to filter from dataframe based on list of values and I am able to run it the way it is given in example 1. However, when I convert the elements into list and then pass the list into 'isin' function inside the filter function it does not work (shown in example 2).我正在尝试根据值列表从 dataframe 中过滤,并且我能够按照示例 1 中给出的方式运行它。但是,当我将元素转换为列表然后将列表传递给过滤器内的“isin”function function 它不起作用(如示例 2 所示)。

val df1 = sc.parallelize(Seq((1,"abcd"), (2,"defg"), (3, "ghij"),(4,"xyzz"),(5,"lmnop"),(6,"pqrst"),(7,"wxyz"),(8,"lmnoa"),(9,"jklm"))).toDF("c1","c2")
//example 1:
val df2 = df1.filter(substring(col("c2"), 0, 3).isin("abc","def","ghi"))

//example 2:
val given_list = List("abc","def","ghi")
val df3 = df1.filter(substring(col("c2"), 0, 3).isin(given_list))

The error message while running example 2 is shown below:运行示例 2 时的错误消息如下所示:

Exception in thread "main" java.lang.RuntimeException: Unsupported literal type class scala.collection.immutable.$colon$colon List(abc, def, ghi)                                                  

19/10/22 17:03:10 INFO spark.SparkContext: Invoking stop() from shutdown hook                                                                                                                                   
19/10/22 17:03:10 INFO server.AbstractConnector: Stopped Spark@5817c15f{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}                                                                                                      
19/10/22 17:03:10 INFO ui.SparkUI: Stopped Spark web UI at http://192.---.---.---:----                                                                                                                          
19/10/22 17:03:10 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!                                                                                                            
19/10/22 17:03:10 INFO memory.MemoryStore: MemoryStore cleared                                                                                                                                                  
19/10/22 17:03:10 INFO storage.BlockManager: BlockManager stopped                                                                                                                                               
19/10/22 17:03:10 INFO storage.BlockManagerMaster: BlockManagerMaster stopped                                                                                                                                   
19/10/22 17:03:10 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!                                                                                      
19/10/22 17:03:10 INFO spark.SparkContext: Successfully stopped SparkContext                                                                                                                                    
19/10/22 17:03:10 INFO util.ShutdownHookManager: Shutdown hook called                                                                                                                                           
19/10/22 17:03:10 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-d17c100d-3e95-4016-a2fd-4e1e02b2449f          

Thanks in Advance.提前致谢。

Method isin takes an Any* varargs parameter rather than a collection like List .方法isin采用Any* varargs 参数,而不是像List这样的集合。 You can use the "splat" operator (ie _* ) as shown below:您可以使用“splat”运算符(即_* ),如下所示:

df1.filter(substring(col("c2"), 0, 3).isin(given_list: _*))

Spark 2.4 + does provide method isInCollection that takes an Iterable collection, which can be used as follows: Spark 2.4 + 确实提供了isInCollection方法,该方法采用Iterable集合,可以按如下方式使用:

df1.filter(substring(col("c2"), 0, 3).isInCollection(given_list))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM