Writing out file to delta lake produces different results from Data Frame read using Apache Spark on Databricks

Question

I have the following code on my databricks notebook

fulldf = spark.read.format("csv").option("header", True).option("inferSchema",True).load("/databricks-datasets/flights/")

fulldf.write.format("delta").mode("overwrite").save('/mnt/lake/BASE/flights/Full/')

df = fulldf.limit(10)
df.write.format("delta").mode("overwrite").save('/mnt/lake/BASE/flights/Small/')

when I do a display on df I get the results I expect to see:

display(df)

As you can see there are ten rows with correct information

However, when I read the actual parquet saved to '/mnt/lake/BASE/flights/Small/' using the following:

test = spark.read.parquet('/mnt/lake/BASE/flights/Small/part-00000-d9d24a80-28d6-43f5-950f-3c53a7d1336a-c000.snappy.parquet')

display(test)

I get a completely different result (although it should be the exact same result)

This is so strange.

I believe the problem is with limiting the results to 10 rows, but I don't see why I should get a completely different result

Answer 1

I am surprised you even got output. On Databricks I got nothing but an error with your read approach.

As it is a delta file / sub directory and you must use the delta format therefore. Sure, it uses parquet underneath, but you need to use the delta api .

Eg

df.write.format("delta").mode("overwrite").save("/AAAGed")

and

df = spark.read.format("delta").load("/AAAGed")

and apply partitioning - if present, with a filter.

Writing out file to delta lake produces different results from Data Frame read using Apache Spark on Databricks

Question

1 answers

solution1
0 2022-02-27 12:35:43

Writing out file to delta lake produces different results from Data Frame read using Apache Spark on Databricks

Question

1 answers

solution1 0 2022-02-27 12:35:43

solution1
0 2022-02-27 12:35:43