简体   繁体   中英

How to delete/filter the specific rows from a spark dataframe

I want to delete specific records from a Spark dataframe:
Sample Input:
样本输入

Expected output:
预期产出

Discarded Rows:
丢弃的行

I have written the below code to filter the dataframe(Which is incorrect):


val Name = List("Rahul","Mahesh","Gaurav")
val Age =List(20,55)

val final_pub_df = df.filter(!col("Name").isin(Name:_*) &&  !col("Age").isin(Age:_*))

So my question is - How to filter the dataframe for more than one column with specific filter criteria. The dataframe should be filtered on the basis of the combination of Name and Age fields.

Here's the solution. Based on your dataset I formulated problem -

below dataframe has incorrect entries. I want to remove all incorrect records and keep only correct records -

val Friends = Seq(
      ("Rahul", "99", "AA"),
      ("Rahul", "20", "BB"),
      ("Rahul", "30", "BB"),
      ("Mahesh", "55", "CC"),
      ("Mahesh", "88", "DD"),
      ("Mahesh", "44", "FF"),
      ("Ramu", "30", "FF"),
      ("Gaurav", "99", "PP"),
      ("Gaurav", "20", "HH")).toDF("Name", "Age", "City")

Arrays for filtering -

val Name = List("Rahul", "Mahesh", "Gaurav")
val IncorrectAge = List(20, 55)

Dataops -

Friends.filter(!(col("Name").isin(Name: _*) && col("Age").isin(IncorrectAge: _*))).show

Here's the output -

+------+---+----+
|  Name|Age|City|
+------+---+----+
| Rahul| 99|  AA|
| Rahul| 30|  BB|
|Mahesh| 88|  DD|
|Mahesh| 44|  FF|
|  Ramu| 30|  FF|
|Gaurav| 99|  PP|
+------+---+----+

You can also do it with help of joins..

Create a Badrecords df -

val badrecords = Friends.filter(col("Name").isin(Name: _*) && col("Age").isin(IncorrectAge: _*))

User left_anti join to select Friends minus badrecords -

 Friends.alias("left").join(badrecords.alias("right"), Seq("Name", "Age"), "left_anti").show

Here's the output -

+------+---+----+
|  Name|Age|City|
+------+---+----+
| Rahul| 99|  AA|
| Rahul| 30|  BB|
|Mahesh| 88|  DD|
|Mahesh| 44|  FF|
|  Ramu| 30|  FF|
|Gaurav| 99|  PP|
+------+---+----+

I think you may want to flip the not condition.... filter in dataframe is an alias to where clause in sql.

So you want the query to be

df.filter(col("Name").isin(Name:_*) && col("Age").isin(Age:_*))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM