Apache Spark 不使用来自 Hive 分区外部表的分区信息

Question

I have a simple Hive-External table which is created on top of S3 (Files are in CSV format).我有一个简单的 Hive-External 表，它是在 S3 之上创建的（文件为 CSV 格式）。 When I run the hive query it shows all records and partitions.当我运行 hive 查询时，它会显示所有记录和分区。

However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied.但是，当我在 Spark 中使用同一个表时（其中 Spark SQL 在分区列上有一个 where 条件），它没有显示应用了分区过滤器。 However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.但是对于 Hive Managed table ，Spark 能够使用分区信息并应用分区过滤器。

Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ?是否有任何标志或设置可以帮助我在 Spark 中使用 Hive 外部表的分区？ Thanks.谢谢。

Update :更新：

For some reason, only the spark plan is not showing the Partition Filters.出于某种原因，只有火花计划没有显示分区过滤器。 However, when you look at the data loaded its only loading the data needed from the partitions.但是，当您查看加载的数据时，它只会加载分区所需的数据。

Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB例如：其中 rating=0 ，仅加载一个 1 MB 的文件，当我没有过滤器时，它会读取 3 MB 的所有 3 个分区

Answer 1

tl; tl; dr set the following before the running sql for external table spark.sql("set spark.sql.hive.convertMetastoreOrc=true") dr 在运行 sql 之前为外部表spark.sql("set spark.sql.hive.convertMetastoreOrc=true")

The difference in behaviour is not because of extenal/managed table.行为的差异不是因为外部/托管表。
The behaviour depends on two factors行为取决于两个因素
1. Where the table was created(Hive or Spark) 1. 表的创建位置（Hive 或 Spark）
2. File format (I believe it is ORC in this case, from the screen capture) 2.文件格式（我相信在这种情况下是ORC，来自屏幕截图）

Where the table was created(Hive or Spark)创建表的位置（Hive 或 Spark）

If the table was create using Spark APIs, it is considered as Datasource table.如果该表是使用 Spark API 创建的，则将其视为数据源表。
If the table was created usng HiveQL, it is considered as Hive native table.如果该表是使用 HiveQL 创建的，则它被视为 Hive 本机表。
The metadata of both these tables are store in Hive metastore, the only difference is in the provider field of TBLPROPERTIES of the tables( describe extended <tblName> ).这两个表的元数据都存储在 Hive Metastore 中，唯一的区别在于表的TBLPROPERTIES的provider字段（ describe extended <tblName> ）。 The value of the property is orc or empty in Spark table and hive for a Hive.属性的值为orc或在 Spark 表和hive为空。

How spark uses this information spark 如何使用这些信息

When provider is not hive (datasource table), Spark uses its native way of processing the data.当 provider 不是hive （数据源表）时，Spark 使用其原生方式处理数据。
If provider is hive , Spark uses Hive code to process the data.如果 provider 是hive ，Spark 使用 Hive 代码来处理数据。

Fileformat文件格式

Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc and Parquet Flags: Spark 提供配置标志以指示引擎使用 Datasource 方式处理流动文件格式的数据 = Orc和Parquet标志：

`Orc`

  val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
    .doc("When set to true, the built-in ORC reader and writer are used to process " +
      "ORC tables created by using the HiveQL syntax, instead of Hive serde.")
    .booleanConf
    .createWithDefault(true)

`Parquet`

val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
    .doc("When set to true, the built-in Parquet reader and writer are used to process " +
      "parquet tables created by using the HiveQL syntax, instead of Hive serde.")
    .booleanConf
    .createWithDefault(true)

Answer 2

I also ran into this kind of problem having multiple joins of internal and external tables.我也遇到了这种具有多个内部和外部表连接的问题。

None of the tricks work including：这些技巧都不起作用，包括：

    spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
    spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
    spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")

anyone who knows how to solve this problem.任何知道如何解决这个问题的人。

Apache Spark 不使用来自 Hive 分区外部表的分区信息

问题描述

2 个解决方案

解决方案1
2 2019-08-25 06:27:36

Where the table was created(Hive or Spark)创建表的位置（Hive 或 Spark）

How spark uses this information spark 如何使用这些信息

Fileformat文件格式

`Orc`

`Parquet`

解决方案2
-1 2021-04-16 02:49:00

Apache Spark 不使用来自 Hive 分区外部表的分区信息

问题描述

2 个解决方案

解决方案1 2 2019-08-25 06:27:36

Where the table was created(Hive or Spark)创建表的位置（Hive 或 Spark）

How spark uses this information spark 如何使用这些信息

Fileformat文件格式

Orc

Parquet

解决方案2 -1 2021-04-16 02:49:00

解决方案1
2 2019-08-25 06:27:36

`Orc`

`Parquet`

解决方案2
-1 2021-04-16 02:49:00