Cache error when querying table stored in Glue Data Catalog through Zeppelin

Question

I have an error with the way Zeppelin cache tables. We update the data in the Glue Data Catalog in real time, so when we want to query a partition that was updated using Spark, sometimes we get the following error:

org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://bucket/prefix/partition.snappy.parquet, range: 0-16165503, partition values: [empty row], isDataPresent: false, eTag: 53ea26b5ecc9a194efe5163f3c297800-1

This can be solved by issuing the command refresh table <table_name> or restarting the Spark interpreter from the Zeppelin UI, but it might as well be the retrying that solves the issue instead of deleting the cache.

One solution may be to run a scheduled query that refresh all tables at a given time, but this would be highly inefficient.

Thanks!

Answer 1

please spark.sql("refresh TABLE {db}.{table}")

When to execute REFRESH TABLE my_table in spark?

Cache error when querying table stored in Glue Data Catalog through Zeppelin

Question

1 answers

solution1
0 2022-12-06 04:33:05

Cache error when querying table stored in Glue Data Catalog through Zeppelin

Question

1 answers

solution1 0 2022-12-06 04:33:05

solution1
0 2022-12-06 04:33:05