Reading csv files in PySpark

Question

I am trying to read csv file and convert into dataframe. input.txt

 4324,'Andy',43.5,20.3,53.21 

 2342,'Sam',22.1 

 3248,'Jane',11.05,12.87

 6457,'Bob',32.1,75.23,71.6

Schema: Id, Name,Jan,Feb,March

As you see the csv file doesn't have "," if there are no trailing expenses.

Code:

from pyspark.sql.types import *
input1= sc.textFile('/FileStore/tables/input.txt').map(lambda x: x.split(","))

schema = StructType([StructField('Id',StringType(),True), StructField('Name',StringType(),True), StructField('Jan',StringType(),True), StructField('Feb',StringType(),True), StructField('Mar',StringType(),True)])

df3 = sqlContext.createDataFrame(input1, schema)

I get ValueError: Length of object (4) does not match with length of fields (5). How do I resolve this?

Answer 1

I would first import the file using pandas which should handle everything for you. From there you can then convert the pandas DataFrame to spark and do all your usual stuff. I copied your example txt file and quickly wrote up some code to confirm that it would all work:

import pandas as pd

# Reading in txt file as csv
df_pandas = pd.read_csv('<your location>/test.txt', 
                        sep=",")

# Converting to spark dataframe and displaying
df_spark = spark.createDataFrame(df_pandas)
display(df_pandas)

Which produced the following output:

The faster method would be to import through spark:

# Importing csv file using pyspark
csv_import = sqlContext.read\
                       .format('csv')\
                       .options(sep = ',', header='true', inferSchema='true')\
                       .load('<your location>/test.txt')


display(csv_import)

Which gives the same output.

Answer 2

from pyspark.sql.types import *

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Test").getOrCreate()

fields = [StructField('Id', StringType(), True), StructField('Name', StringType(), True),
          StructField('Jan', StringType(), True), StructField('Feb', StringType(), True),
          StructField('Mar', StringType(), True)]

schema = StructType(fields)

data = spark.read.format("csv").load("test2.txt")

df3 = spark.createDataFrame(data.rdd, schema)
df3.show()

Output:

+----+------+-----+-----+-----+
|  Id|  Name|  Jan|  Feb|  Mar|
+----+------+-----+-----+-----+
|4324|'Andy'| 43.5| 20.3|53.21|
|2342| 'Sam'| 22.1| null| null|
|3248|'Jane'|11.05|12.87| null|
|6457| 'Bob'| 32.1|75.23| 71.6|
+----+------+-----+-----+-----+

Answer 3

Here are a couple options for you to consider. These use the wildcard character, so you can loop through all folders and sub-folders, look for files with names that match a specific pattern, and merge everything into a dingle dataframe.

val myDFCsv = spark.read.format("csv")
   .option("sep",",")
   .option("inferSchema","true")
   .option("header","true")
   .load("mnt/rawdata/2019/01/01/client/ABC*.gz")

myDFCsv.show()
myDFCsv.head()
myDFCsv.count()


//////////////////////////////////////////
// If you also need to load the filename
import org.apache.spark.sql.functions.input_file_name
val myDFCsv = spark.read.format("csv")
   .option("sep",",")
   .option("inferSchema","true")
   .option("header","true")
   .load("mnt/rawdata/2019/01/01/client/ABC*.gz")
   .withColumn("file_name",input_file_name())


myDFCsv.show(false)
myDFCsv.head()
myDFCsv.count()

Reading csv files in PySpark

Question

3 answers

solution1
1 2019-12-16 23:28:25

solution2
1 2019-12-17 09:36:39

solution3
0 2019-12-17 01:09:15

Reading csv files in PySpark

Question

3 answers

solution1 1 2019-12-16 23:28:25

solution2 1 2019-12-17 09:36:39

solution3 0 2019-12-17 01:09:15

solution1
1 2019-12-16 23:28:25

solution2
1 2019-12-17 09:36:39

solution3
0 2019-12-17 01:09:15