简体   繁体   中英

How to read a csv file which contains empty spaces between columns in dataframe scala?

在记事本中打开的 CSV 文件 Tried to load the csv file which contains empty spaces between columns.

1st line from csv :

058921107                          039128053                          20200701-290640-0             20200701 000000BORGWARNER ITHACA LLC DBA BORGWARNE                         489140-10001                       LDD INVENTORY                                               039128053           1     4359697                                           PACKAGE,CHAIN DRIVE                                                                                 005                 285000492           0                     19691231 185959                              0                     20200101 00000020200630 000000IMMEDIATE                1600                  20200630 000000   

Sample script used:

import org.apache.spark.sql.{SQLContext, SparkSession}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._

var df1: DataFrame = null
df1=spark.read.option("header","true").option("inferSchema","true").option("delimiter"," ").option("ignoreLeadingWhiteSpace","true")
.option("ignoreTrailingWhiteSpace","true").csv("test.csv")

df1.show(2)

I have specified the column size as 18 whether that is correct or not.

df = spark.read.text('test.csv')

col_size = 18

df.withColumn('value', split(regexp_replace(regexp_replace('value', '([ ]*)$', ''), '([ ]{2,})', '\|'), '\|')) \
  .select(*[col('value')[i] for i in range(0, col_size)]) \
  .toDF(*[f'col{i + 1}' for i in range(0, col_size)]).show(30, False)

+---------+---------+-----------------+--------------------------------------------------+------------+-------------+---------+----+-------+-------------------+-----+---------+-----+---------------+-----+---------------------------------------+-----+---------------+
|col1     |col2     |col3             |col4                                              |col5        |col6         |col7     |col8|col9   |col10              |col11|col12    |col13|col14          |col15|col16                                  |col17|col18          |
+---------+---------+-----------------+--------------------------------------------------+------------+-------------+---------+----+-------+-------------------+-----+---------+-----+---------------+-----+---------------------------------------+-----+---------------+
|058921107|039128053|20200701-290640-0|20200701 000000BORGWARNER ITHACA LLC DBA BORGWARNE|489140-10001|LDD INVENTORY|039128053|1   |4359697|PACKAGE,CHAIN DRIVE|005  |285000492|0    |19691231 185959|0    |20200101 00000020200630 000000IMMEDIATE|1600 |20200630 000000|
+---------+---------+-----------------+--------------------------------------------------+------------+-------------+---------+----+-------+-------------------+-----+---------+-----+---------------+-----+---------------------------------------+-----+---------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM