简体   繁体   中英

removing spaces in DataFrame using SCALA. (I have loaded CSV file into RDD then trying to remove spaces from it

I have data with un-wanted spaces & Null values in my CSV file. I have loaded this file into spark RDD till here no problem. Now I have to remove spaces and null values from this RDD. How to do that can anyone help me pls?

object Oracle {def main(args: Array[String]): Unit = {
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession    
System.setProperty("hadoop.home.dir","D:\\hadoop\\");
val spark = SparkSession.builder().appName("Schema").master("local[*]").getOrCreate()

import spark.implicits._
import org.apache.spark.sql.functions._
val inpp = spark.read.csv("file:///C:/Users/user/Desktop/xyz.csv")
inpp.show()

val df = inpp.toDF("name")

inpp.select(
    col("name"),
    regexp_replace(col("name"), "\\s+$", ""),
    rtrim(col("name")),
    length(col("name"))
    ).show() }}

Here is my data with un-wanted spaces and null values.

You can do like this:

scala> val someDFWithName = Seq((1, "anu rag"), (2,"raj u"),(3, "  ram   "), (4, null), (5, "")).toDF("id", "name")

Now Filter the empty or null values and apply the regex to remove the extra space:

scala> someDFWithName.filter(col("name") !== "").select(
 |     col("name"),
 |     regexp_replace(col("name"), " ", ""),
 |     length(col("name"))
 |     ).show()

Output will be:

+--------+-------------------------+------------+
|    name|regexp_replace(name,  , )|length(name)|
+--------+-------------------------+------------+
| anu rag|                   anurag|           7|
|   raj u|                     raju|           5|
|  ram   |                      ram|           8|
+--------+-------------------------+------------+

Thanks.

You can provide these options in the csv reader to trim the data and later filter the irrelevant ones:

val df = spark.read
.format("csv")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.option("inferSchema", "true")
.option("header", "true")
.load("file:///C:/Users/user/Desktop/xyz.csv")
.filter(col("name").isNotNull)
.show()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM