I have data with un-wanted spaces & Null values in my CSV file. I have loaded this file into spark RDD till here no problem. Now I have to remove spaces and null values from this RDD. How to do that can anyone help me pls?
object Oracle {def main(args: Array[String]): Unit = {
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
System.setProperty("hadoop.home.dir","D:\\hadoop\\");
val spark = SparkSession.builder().appName("Schema").master("local[*]").getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
val inpp = spark.read.csv("file:///C:/Users/user/Desktop/xyz.csv")
inpp.show()
val df = inpp.toDF("name")
inpp.select(
col("name"),
regexp_replace(col("name"), "\\s+$", ""),
rtrim(col("name")),
length(col("name"))
).show() }}
You can do like this:
scala> val someDFWithName = Seq((1, "anu rag"), (2,"raj u"),(3, " ram "), (4, null), (5, "")).toDF("id", "name")
Now Filter the empty or null values and apply the regex to remove the extra space:
scala> someDFWithName.filter(col("name") !== "").select(
| col("name"),
| regexp_replace(col("name"), " ", ""),
| length(col("name"))
| ).show()
Output will be:
+--------+-------------------------+------------+
| name|regexp_replace(name, , )|length(name)|
+--------+-------------------------+------------+
| anu rag| anurag| 7|
| raj u| raju| 5|
| ram | ram| 8|
+--------+-------------------------+------------+
Thanks.
You can provide these options in the csv reader to trim the data and later filter the irrelevant ones:
val df = spark.read
.format("csv")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.option("inferSchema", "true")
.option("header", "true")
.load("file:///C:/Users/user/Desktop/xyz.csv")
.filter(col("name").isNotNull)
.show()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.