I am trying to read csv file and convert into dataframe. input.txt
4324,'Andy',43.5,20.3,53.21
2342,'Sam',22.1
3248,'Jane',11.05,12.87
6457,'Bob',32.1,75.23,71.6
Schema: Id, Name,Jan,Feb,March
As you see the csv file doesn't have "," if there are no trailing expenses.
Code:
from pyspark.sql.types import *
input1= sc.textFile('/FileStore/tables/input.txt').map(lambda x: x.split(","))
schema = StructType([StructField('Id',StringType(),True), StructField('Name',StringType(),True), StructField('Jan',StringType(),True), StructField('Feb',StringType(),True), StructField('Mar',StringType(),True)])
df3 = sqlContext.createDataFrame(input1, schema)
I get ValueError: Length of object (4) does not match with length of fields (5). How do I resolve this?
I would first import the file using pandas which should handle everything for you. From there you can then convert the pandas DataFrame to spark and do all your usual stuff. I copied your example txt file and quickly wrote up some code to confirm that it would all work:
import pandas as pd
# Reading in txt file as csv
df_pandas = pd.read_csv('<your location>/test.txt',
sep=",")
# Converting to spark dataframe and displaying
df_spark = spark.createDataFrame(df_pandas)
display(df_pandas)
Which produced the following output:
The faster method would be to import through spark:
# Importing csv file using pyspark
csv_import = sqlContext.read\
.format('csv')\
.options(sep = ',', header='true', inferSchema='true')\
.load('<your location>/test.txt')
display(csv_import)
Which gives the same output.
from pyspark.sql.types import *
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()
fields = [StructField('Id', StringType(), True), StructField('Name', StringType(), True),
StructField('Jan', StringType(), True), StructField('Feb', StringType(), True),
StructField('Mar', StringType(), True)]
schema = StructType(fields)
data = spark.read.format("csv").load("test2.txt")
df3 = spark.createDataFrame(data.rdd, schema)
df3.show()
Output:
+----+------+-----+-----+-----+
| Id| Name| Jan| Feb| Mar|
+----+------+-----+-----+-----+
|4324|'Andy'| 43.5| 20.3|53.21|
|2342| 'Sam'| 22.1| null| null|
|3248|'Jane'|11.05|12.87| null|
|6457| 'Bob'| 32.1|75.23| 71.6|
+----+------+-----+-----+-----+
Here are a couple options for you to consider. These use the wildcard character, so you can loop through all folders and sub-folders, look for files with names that match a specific pattern, and merge everything into a dingle dataframe.
val myDFCsv = spark.read.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
myDFCsv.show()
myDFCsv.head()
myDFCsv.count()
//////////////////////////////////////////
// If you also need to load the filename
import org.apache.spark.sql.functions.input_file_name
val myDFCsv = spark.read.format("csv")
.option("sep",",")
.option("inferSchema","true")
.option("header","true")
.load("mnt/rawdata/2019/01/01/client/ABC*.gz")
.withColumn("file_name",input_file_name())
myDFCsv.show(false)
myDFCsv.head()
myDFCsv.count()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.