简体   繁体   中英

Spark sql queries on partitioned table with removed partitions files fails

Below is what am trying in order,

  1. create partitioned table in hive based on current hour.
  2. use spark hive context and perform msck repair table.
  3. delete the hdfs folders of one of the added partitions manually.
  4. use spark hive context again and perform a> msck repair this does not remove the partition added already with no hdfs folder. seems like known behavior with respect to "msck repair" b> select * from tablexxx where (existing partition); Fails with exception : Filenotfound exception pointing to hdfs folder which was deleted manually.

Any insights on this behavior would be of great help.

Yes, MSCK REPAIR TABLE will only discover new partitions, not delete "old" ones.

Working with external hive tables where you deleted the HDFS folder, I see two solutions

  1. drop the table (files will not be deleted because the table is external), then re-create the table using the same location, and then run MSCK REPAIR TABLE . This is my prefered solution.
  2. Drop all the partitions you deleted using ALTER TABLE <table> DROP PARTITION <partition>

What you observe in your case is maybe related to these: https://issues.apache.org/jira/browse/SPARK-15044 and https://issues.apache.org/jira/browse/SPARK-19187

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM