简体   繁体   English

如何从火花 dataframe 中删除/过滤特定行

[英]How to delete/filter the specific rows from a spark dataframe

I want to delete specific records from a Spark dataframe:我想从 Spark dataframe 中删除特定记录:
Sample Input:样本输入:
样本输入

Expected output:预期 output:
预期产出

Discarded Rows:丢弃的行:
丢弃的行

I have written the below code to filter the dataframe(Which is incorrect):我编写了以下代码来过滤数据框(不正确):


val Name = List("Rahul","Mahesh","Gaurav")
val Age =List(20,55)

val final_pub_df = df.filter(!col("Name").isin(Name:_*) &&  !col("Age").isin(Age:_*))

So my question is - How to filter the dataframe for more than one column with specific filter criteria.所以我的问题是 - 如何过滤 dataframe 以获得多个具有特定过滤条件的列。 The dataframe should be filtered on the basis of the combination of Name and Age fields. dataframe 应根据名称和年龄字段的组合进行过滤。

Here's the solution.这是解决方案。 Based on your dataset I formulated problem -根据您的数据集,我提出了问题-

below dataframe has incorrect entries. dataframe 下面的条目不正确。 I want to remove all incorrect records and keep only correct records -我想删除所有不正确的记录并只保留正确的记录 -

val Friends = Seq(
      ("Rahul", "99", "AA"),
      ("Rahul", "20", "BB"),
      ("Rahul", "30", "BB"),
      ("Mahesh", "55", "CC"),
      ("Mahesh", "88", "DD"),
      ("Mahesh", "44", "FF"),
      ("Ramu", "30", "FF"),
      ("Gaurav", "99", "PP"),
      ("Gaurav", "20", "HH")).toDF("Name", "Age", "City")

Arrays for filtering - Arrays 用于滤波 -

val Name = List("Rahul", "Mahesh", "Gaurav")
val IncorrectAge = List(20, 55)

Dataops -数据操作 -

Friends.filter(!(col("Name").isin(Name: _*) && col("Age").isin(IncorrectAge: _*))).show

Here's the output -这是 output -

+------+---+----+
|  Name|Age|City|
+------+---+----+
| Rahul| 99|  AA|
| Rahul| 30|  BB|
|Mahesh| 88|  DD|
|Mahesh| 44|  FF|
|  Ramu| 30|  FF|
|Gaurav| 99|  PP|
+------+---+----+

You can also do it with help of joins..您也可以在连接的帮助下做到这一点..

Create a Badrecords df -创建不良记录 df -

val badrecords = Friends.filter(col("Name").isin(Name: _*) && col("Age").isin(IncorrectAge: _*))

User left_anti join to select Friends minus badrecords -用户 left_anti 加入 select 好友减坏记录 -

 Friends.alias("left").join(badrecords.alias("right"), Seq("Name", "Age"), "left_anti").show

Here's the output -这是 output -

+------+---+----+
|  Name|Age|City|
+------+---+----+
| Rahul| 99|  AA|
| Rahul| 30|  BB|
|Mahesh| 88|  DD|
|Mahesh| 44|  FF|
|  Ramu| 30|  FF|
|Gaurav| 99|  PP|
+------+---+----+

I think you may want to flip the not condition.... filter in dataframe is an alias to where clause in sql.我认为您可能想要翻转 not 条件.... dataframe 中的过滤器是 sql 中 where 子句的别名。

So you want the query to be所以你希望查询是

df.filter(col("Name").isin(Name:_*) && col("Age").isin(Age:_*))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM