Spark SQL 无法递归读取 hive 表的 HDFS 子文件夹（Spark - 2.4.6）

Question

我们正在尝试使用 Spark-SQL 读取 hive 表，但它没有显示任何记录（在输出中给出 0 条记录）。 经过检查，我们发现表的 HDFS 文件存储在多个子目录中，如下所示 -

hive> [hadoop@ip-10-37-195-106 CDPJobs]$ hdfs dfs -ls /its/cdp/refn/cot_tbl_cnt_hive/     
Found 18 items     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/1     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/10     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/11     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/12     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/13     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/14     
drwxrwxr-x+ - hadoop hadoop 0 2021-12-19 20:17 /its/cdp/refn/cot_tbl_cnt_hive/15

我们尝试在 spark-defaults.conf 文件中设置以下属性，但问题仍然存在。

set spark.hadoop.hive.supports.subdirectories = true;    
set spark.hadoop.hive.mapred.supports.subdirectories = true;     
set spark.hadoop.hive.input.dir.recursive=true;     
set mapreduce.input.fileinputformat.input.dir.recursive=true;          
set recursiveFileLookup=true;            
set spark.hive.mapred.supports.subdirectories=true;         
set spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive=true;

有人知道对此有任何解决方案吗？ 我们使用的是 Spark 2.4.6 版。

Answer 1

sparkSession = (SparkSession
                    .builder
                    .appName('USS - Unified Scheme of Sells')
                    .config("hive.metastore.uris", "thrift://probighhwm001:9083", conf=SparkConf())
                    .config("hive.input.dir.recursive", "true")
                    .config("hive.mapred.supports.subdirectories", "true")
                    .config("hive.supports.subdirectories", "true")
                    .config("mapred.input.dir.recursive", "true")
                    .enableHiveSupport()
                    .getOrCreate()
                    )

Spark SQL 无法递归读取 hive 表的 HDFS 子文件夹（Spark - 2.4.6）

问题描述

1 个解决方案

解决方案1
0 2021-12-22 09:01:04

Spark SQL 无法递归读取 hive 表的 HDFS 子文件夹（Spark - 2.4.6）

问题描述

1 个解决方案

解决方案1 0 2021-12-22 09:01:04

解决方案1
0 2021-12-22 09:01:04