简体   繁体   中英

Values are coming null for all the columns in spark scala dataframe

I have below dataset as input

816|27555832600|01|14|25|  |  
825|54100277425|14|03|25|15|  
9003|54100630574|  |  |  |  |  
809|51445926423|12|08|25|17|  

getting below as output:

null|null|null|null|null|null|
825|54100277425|  14|   3|  25|  15|
null|null|null|null|null|null|
809|51445926423|  12|   8|  25|  17|

expected output

816|27555832600|01|14|25|null|  
825|54100277425|14|03|25|15|  
9003|54100630574|null|null|null|null|  
809|51445926423|12|08|25|17|  

I have tried the below code to load the.txt or.bz2 file.

val dataset = sparkSession.read.format(formatType)
        .option("DELIMITER", "|"))
        .schema(schema_new)
        .csv(dataFilePath)

I tried your problem statement. I am using Spark 3.0.1 version to solve this use case. It working as expected. try below code snippet.

val sampleDS = spark.read.options(Map("DELIMITER"->"|")).csv("D:\\DataAnalysis\\DataSample.csv")
sampleDS.show()

Output ->
+----+-----------+---+---+---+---+---+
| _c0|        _c1|_c2|_c3|_c4|_c5|_c6|
+----+-----------+---+---+---+---+---+
| 816|27555832600| 01| 14| 25|   |   |
| 825|54100277425| 14| 03| 25| 15|   |
|9003|54100630574|   |   |   |   |   |
| 809|51445926423| 12| 08| 25| 17|   |
+----+-----------+---+---+---+---+---+

Consider if your having a blank line in input data.

Input data after adding blank line

816|27555832600|01|14|25|  |  
825|54100277425|14|03|25|15|  
9003|54100630574|  |  |  |  |  
||||
809|51445926423|12|08|25|17| 

After reading data, you can simply use sampleDS.na.drop.show() to remove blank or null data.

Please note that, if you are having only blank line, then Spark does not consider in dataframe. Spark removes blank line while reading itself.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM