无法从Spark读取hadoop / hive外部s3表

Question

All of sudden I am unable to read the hive external s3 table from spark, I noticed there are subfolders got created under few partitions. 突然我无法从spark读取hive外部s3表，我注意到在几个分区下创建了子文件夹。

I hope there is any parameter or setting can be configured so Hadoop doesn't create these subfolders. 我希望可以配置任何参数或设置，以便Hadoop不会创建这些子文件夹。

when I manually delete subfolders from s3, I can read table. 当我从s3中手动删除子文件夹时，我可以读取表。 but need to find a way so these subfolders won't get created randomly in future. 但需要找到一种方法，以使这些子文件夹将来不会被随机创建。

CREATE EXTERNAL TABLE `mydb.mytable`(
    `id` string COMMENT 'from deserializer', 
    `attribute_value` string COMMENT 'from deserializer', 
    `attribute_date` string COMMENT 'from deserializer', 
    `source_id` string COMMENT 'from deserializer')
     PARTITIONED BY (`partition_source_id` int)
     ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde' 
     STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
     OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
     LOCATION 's3://path/my_data'
     TBLPROPERTIES ('transient_lastDdlTime'='1567170767')

When I run select * query I get: 当我运行select * query时，我得到：

error: java.io.IOException: Not a file: s3://my_path/partition_source_id=11/1 1 statement failed. 错误：java.io.IOException：不是文件：s3：// my_path / partition_source_id = 11/1 1条语句失败。

Answer 1

I don't think this DDL creates subfolders. 我认为该DDL不会创建子文件夹。 If there is some job to load data into 's3://path/my_data' and executes DDL add partition on mydb.mytable, I think you should take a look the job. 如果有一些工作可以将数据加载到“ s3：// path / my_data”中并在mydb.mytable上执行DDL add分区，我想您应该看一下该工作。

无法从Spark读取hadoop / hive外部s3表

问题描述

1 个解决方案

解决方案1
0 2019-09-04 00:11:14

无法从Spark读取hadoop / hive外部s3表

问题描述

1 个解决方案

解决方案1 0 2019-09-04 00:11:14

解决方案1
0 2019-09-04 00:11:14