I have below dataset as input
816|27555832600|01|14|25| |
825|54100277425|14|03|25|15|
9003|54100630574| | | | |
809|51445926423|12|08|25|17|
getting below as output:
null|null|null|null|null|null|
825|54100277425| 14| 3| 25| 15|
null|null|null|null|null|null|
809|51445926423| 12| 8| 25| 17|
816|27555832600|01|14|25|null|
825|54100277425|14|03|25|15|
9003|54100630574|null|null|null|null|
809|51445926423|12|08|25|17|
I have tried the below code to load the.txt or.bz2 file.
val dataset = sparkSession.read.format(formatType)
.option("DELIMITER", "|"))
.schema(schema_new)
.csv(dataFilePath)
I tried your problem statement. I am using Spark 3.0.1 version to solve this use case. It working as expected. try below code snippet.
val sampleDS = spark.read.options(Map("DELIMITER"->"|")).csv("D:\\DataAnalysis\\DataSample.csv")
sampleDS.show()
Output ->
+----+-----------+---+---+---+---+---+
| _c0| _c1|_c2|_c3|_c4|_c5|_c6|
+----+-----------+---+---+---+---+---+
| 816|27555832600| 01| 14| 25| | |
| 825|54100277425| 14| 03| 25| 15| |
|9003|54100630574| | | | | |
| 809|51445926423| 12| 08| 25| 17| |
+----+-----------+---+---+---+---+---+
Consider if your having a blank line in input data.
Input data after adding blank line
816|27555832600|01|14|25| |
825|54100277425|14|03|25|15|
9003|54100630574| | | | |
||||
809|51445926423|12|08|25|17|
After reading data, you can simply use sampleDS.na.drop.show()
to remove blank or null data.
Please note that, if you are having only blank line, then Spark does not consider in dataframe. Spark removes blank line while reading itself.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.