I am trying to read airline dataset in databricks.
Path in databricks -> /databricks-datasets/airlines
There are multiple files present under this folder - starting from part-00000 and so on.
Only part-00000 file has header information present. Other files don't have header information
I am trying to read all the files using the following command
df= spark.read.format("csv").option("header", "true").load("/databricks-datasets/airlines/part-*")
For some reason it is not pulling header information from first file part-00000. Is there a way to pull header information from part-00000 file as the other files don't have header info.
Thanks!
You can first read the csv partition file that contains the headers:
df = spark \
.read \
.format("csv")\
.option("header", "true") \
.load("/databricks-datasets/airlines.csv/part-00000")
Then save the schema:
csv_schema = df.schema
And you can now read the all the partitions using the schema csv_schema
:
df = spark \
.read \
.format("csv")\
.schema(csv_schema) \
.load("/databricks-datasets/airlines.csv")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.