Pyspark databricks read header from first file

Question

I am trying to read airline dataset in databricks.

Path in databricks ->  /databricks-datasets/airlines

There are multiple files present under this folder - starting from part-00000 and so on.

Only part-00000 file has header information present. Other files don't have header information

I am trying to read all the files using the following command

df= spark.read.format("csv").option("header", "true").load("/databricks-datasets/airlines/part-*")

For some reason it is not pulling header information from first file part-00000. Is there a way to pull header information from part-00000 file as the other files don't have header info.

Thanks!

Answer 1

You can first read the csv partition file that contains the headers:

df = spark \
    .read \
    .format("csv")\
    .option("header", "true") \
    .load("/databricks-datasets/airlines.csv/part-00000")

Then save the schema:

csv_schema = df.schema

And you can now read the all the partitions using the schema csv_schema :

df = spark \
    .read \
    .format("csv")\
    .schema(csv_schema) \
    .load("/databricks-datasets/airlines.csv")

Pyspark databricks read header from first file

Question

1 answers

solution1
2 ACCPTED 2020-06-05 16:21:25

Pyspark databricks read header from first file

Question

1 answers

solution1 2 ACCPTED 2020-06-05 16:21:25

solution1
2 ACCPTED 2020-06-05 16:21:25