简体   繁体   中英

Cache error when querying table stored in Glue Data Catalog through Zeppelin

I have an error with the way Zeppelin cache tables. We update the data in the Glue Data Catalog in real time, so when we want to query a partition that was updated using Spark, sometimes we get the following error:

org.apache.spark.sql.execution.datasources.FileDownloadException: Failed to download file path: s3://bucket/prefix/partition.snappy.parquet, range: 0-16165503, partition values: [empty row], isDataPresent: false, eTag: 53ea26b5ecc9a194efe5163f3c297800-1

This can be solved by issuing the command refresh table <table_name> or restarting the Spark interpreter from the Zeppelin UI, but it might as well be the retrying that solves the issue instead of deleting the cache.

One solution may be to run a scheduled query that refresh all tables at a given time, but this would be highly inefficient.

Thanks!

please spark.sql("refresh TABLE {db}.{table}")

When to execute REFRESH TABLE my_table in spark?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM