简体   繁体   中英

Pyspark databricks read header from first file

I am trying to read airline dataset in databricks.

Path in databricks ->  /databricks-datasets/airlines 

There are multiple files present under this folder - starting from part-00000 and so on.

Only part-00000 file has header information present. Other files don't have header information

I am trying to read all the files using the following command

df= spark.read.format("csv").option("header", "true").load("/databricks-datasets/airlines/part-*")

For some reason it is not pulling header information from first file part-00000. Is there a way to pull header information from part-00000 file as the other files don't have header info.

Thanks!

You can first read the csv partition file that contains the headers:

df = spark \
    .read \
    .format("csv")\
    .option("header", "true") \
    .load("/databricks-datasets/airlines.csv/part-00000")

Then save the schema:

csv_schema = df.schema

And you can now read the all the partitions using the schema csv_schema :

df = spark \
    .read \
    .format("csv")\
    .schema(csv_schema) \
    .load("/databricks-datasets/airlines.csv")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM