So, the basics are:
Below are my imports:
from pyspark.sql.functions import *
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
fileDirectory = 'data/'
Below is the actual code:
for fname in os.listdir(fileDirectory):
sqlContext.read.format("csv").\
option("header", "true").\
option("inferSchema", "true").\
load(fname)
df_app = app_dat_df
df_app.createOrReplaceTempView(fname)
But I'm getting the following error message:
AnalysisException: u'Unable to infer schema for CSV. It must be specified manually.;'
It would appear that it's not finding issue with the way I'm passing over the files (great), but it's not letting me infer schemas. When I manually go over each file, this has never been an issue.
Can someone give me some pointers on where I can improve them/get it to run?
Many, many thanks!
Since inferSchema
is throwing error you should manually specify the schema of your csv data.
Also as @Marie has mentioned you would need to slightly modify your load syntax.
from pyspark.sql.types import *
customSchema = StructType([
StructField("string_col", StringType(), True),
StructField("integer_col", IntegerType(), True),
StructField("double_col", DoubleType(), True)])
fileDirectory = 'data/'
for fname in os.listdir(fileDirectory):
df_app = sqlContext.read.format("csv").\
option("header", "true"). \
schema(customSchema). \
load(fileDirectory + fname)
Hope this helps!
Don't forget to let us know if it solved your problem :)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.