使用Scala中的自定义函数过滤Spark数据集

Question

I am relatively new to Spark and I am trying to filter out invalid records from a Spark Dataset. 我是Spark的新手，我试图从Spark数据集中过滤掉无效的记录。 My dataset looks something like this: 我的数据集看起来像这样：

| Id | Curr| Col3 |

| 1  | USD | 1111 |
| 2  | CNY | 2222 |
| 3  | USD | 3333 |
| 1  | CNY | 4444 |

In my logic, each Id has a vaild currency. 按照我的逻辑，每个ID都有一个有效的货币。 So it will basically be a map of id->currency 所以基本上它将是id->currency的映射

val map = Map(1 -> "USD", 2 -> "CNY")

I want to filter out the rows from the dataset that have Id not corresponding to the valid currency code. 我想从数据集中过滤出ID与有效货币代码不对应的行。 So after my filter operation, the dataset should look something like this: 因此，在执行过滤操作后，数据集应如下所示：

| Id | Curr| Col3 |

| 1  | USD | 1111 |
| 2  | CNY | 2222 |

The limitation I have here is that I cannot use a UDF. 我在这里的限制是我不能使用UDF。 Can somebody help me in coming up with a filter operation for this? 有人可以帮我解决这个问题吗？

Answer 1

You can create a data frame out of the map and then do an inner join with the original data frame to filter it: 您可以在map创建一个数据框，然后对原始数据框进行内部联接以对其进行过滤：

val map_df = map.toSeq.toDF("Id", "Curr")
// map_df: org.apache.spark.sql.DataFrame = [Id: int, Curr: string]

df.join(map_df, Seq("Id", "Curr")).show
+---+----+----+
| Id|Curr|Col3|
+---+----+----+
|  1| USD|1111|
|  2| CNY|2222|
+---+----+----+

Answer 2

val a = List((1,"USD",1111),(2,"CAN",2222),(3,"USD",4444),(1,"CAN",5555))
val b = Map(1 -> "USD",2 -> "CAN")
a.filter(x => b.keys.exists(_ == x._1)).filter(y => y._2 == b(y._1))

使用Scala中的自定义函数过滤Spark数据集

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-09-15 22:44:25

解决方案2
-1 2017-09-16 02:40:26

使用Scala中的自定义函数过滤Spark数据集

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-09-15 22:44:25

解决方案2 -1 2017-09-16 02:40:26

解决方案1
3 已采纳 2017-09-15 22:44:25

解决方案2
-1 2017-09-16 02:40:26