简体   繁体   中英

Apache Spark not using partition information from Hive partitioned external table

I have a simple Hive-External table which is created on top of S3 (Files are in CSV format). When I run the hive query it shows all records and partitions.

However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied. However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.

Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ? Thanks.

在此处输入图片说明 在此处输入图片说明

在此处输入图片说明

Update :

For some reason, only the spark plan is not showing the Partition Filters. However, when you look at the data loaded its only loading the data needed from the partitions.

Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB

在此处输入图片说明

在此处输入图片说明

tl; dr set the following before the running sql for external table spark.sql("set spark.sql.hive.convertMetastoreOrc=true")

The difference in behaviour is not because of extenal/managed table.
The behaviour depends on two factors
1. Where the table was created(Hive or Spark)
2. File format (I believe it is ORC in this case, from the screen capture)

Where the table was created(Hive or Spark)

If the table was create using Spark APIs, it is considered as Datasource table.
If the table was created usng HiveQL, it is considered as Hive native table.
The metadata of both these tables are store in Hive metastore, the only difference is in the provider field of TBLPROPERTIES of the tables( describe extended <tblName> ). The value of the property is orc or empty in Spark table and hive for a Hive.

How spark uses this information

When provider is not hive (datasource table), Spark uses its native way of processing the data.
If provider is hive , Spark uses Hive code to process the data.

Fileformat

Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc and Parquet Flags:

Orc

  val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
    .doc("When set to true, the built-in ORC reader and writer are used to process " +
      "ORC tables created by using the HiveQL syntax, instead of Hive serde.")
    .booleanConf
    .createWithDefault(true)

Parquet

val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
    .doc("When set to true, the built-in Parquet reader and writer are used to process " +
      "parquet tables created by using the HiveQL syntax, instead of Hive serde.")
    .booleanConf
    .createWithDefault(true)

I also ran into this kind of problem having multiple joins of internal and external tables.

None of the tricks work including:

    spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
    spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
    spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")

anyone who knows how to solve this problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM