简体   繁体   中英

Iterate over files in a directory in pySpark to automate dataframe and SQL table creation

So, the basics are:

  • I'm on Spark 2.+
  • I'm running this all in a Jupyter notebook
  • My goal is to iterate over a number of files in a directory and have spark (1) create dataframes and (2) turn those dataframes into sparkSQL tables. Basically, I want to be able to open the notebook at anytime and have a clean way of always loading everything available to me.

Below are my imports:

from pyspark.sql.functions import *
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

fileDirectory = 'data/'

Below is the actual code:

for fname in os.listdir(fileDirectory):
    sqlContext.read.format("csv").\
            option("header", "true").\
            option("inferSchema", "true").\
            load(fname)

    df_app = app_dat_df
    df_app.createOrReplaceTempView(fname)

But I'm getting the following error message:

AnalysisException: u'Unable to infer schema for CSV. It must be specified manually.;'

It would appear that it's not finding issue with the way I'm passing over the files (great), but it's not letting me infer schemas. When I manually go over each file, this has never been an issue.

Can someone give me some pointers on where I can improve them/get it to run?

Many, many thanks!

Since inferSchema is throwing error you should manually specify the schema of your csv data.

Also as @Marie has mentioned you would need to slightly modify your load syntax.

from pyspark.sql.types import *

customSchema = StructType([
    StructField("string_col", StringType(), True),
    StructField("integer_col", IntegerType(), True),
    StructField("double_col", DoubleType(), True)])

fileDirectory = 'data/'
for fname in os.listdir(fileDirectory):
    df_app = sqlContext.read.format("csv").\
        option("header", "true"). \
        schema(customSchema). \
        load(fileDirectory + fname)

Hope this helps!


Don't forget to let us know if it solved your problem :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM