简体   繁体   English

Apache Spark 不使用来自 Hive 分区外部表的分区信息

[英]Apache Spark not using partition information from Hive partitioned external table

I have a simple Hive-External table which is created on top of S3 (Files are in CSV format).我有一个简单的 Hive-External 表,它是在 S3 之上创建的(文件为 CSV 格式)。 When I run the hive query it shows all records and partitions.当我运行 hive 查询时,它会显示所有记录和分区。

However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied.但是,当我在 Spark 中使用同一个表时(其中 Spark SQL 在分区列上有一个 where 条件),它没有显示应用了分区过滤器。 However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.但是对于 Hive Managed table ,Spark 能够使用分区信息并应用分区过滤器。

Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ?是否有任何标志或设置可以帮助我在 Spark 中使用 Hive 外部表的分区? Thanks.谢谢。

在此处输入图片说明 在此处输入图片说明

在此处输入图片说明

Update :更新 :

For some reason, only the spark plan is not showing the Partition Filters.出于某种原因,只有火花计划没有显示分区过滤器。 However, when you look at the data loaded its only loading the data needed from the partitions.但是,当您查看加载的数据时,它只会加载分区所需的数据。

Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB例如:其中 rating=0 ,仅加载一个 1 MB 的文件,当我没有过滤器时,它会读取 3 MB 的所有 3 个分区

在此处输入图片说明

在此处输入图片说明

tl; tl; dr set the following before the running sql for external table spark.sql("set spark.sql.hive.convertMetastoreOrc=true") dr 在运行 sql 之前为外部表spark.sql("set spark.sql.hive.convertMetastoreOrc=true")

The difference in behaviour is not because of extenal/managed table.行为的差异不是因为外部/托管表。
The behaviour depends on two factors行为取决于两个因素
1. Where the table was created(Hive or Spark) 1. 表的创建位置(Hive 或 Spark)
2. File format (I believe it is ORC in this case, from the screen capture) 2.文件格式(我相信在这种情况下是ORC,来自屏幕截图)

Where the table was created(Hive or Spark)创建表的位置(Hive 或 Spark)

If the table was create using Spark APIs, it is considered as Datasource table.如果该表是使用 Spark API 创建的,则将其视为数据源表。
If the table was created usng HiveQL, it is considered as Hive native table.如果该表是使用 HiveQL 创建的,则它被视为 Hive 本机表。
The metadata of both these tables are store in Hive metastore, the only difference is in the provider field of TBLPROPERTIES of the tables( describe extended <tblName> ).这两个表的元数据都存储在 Hive Metastore 中,唯一的区别在于表的TBLPROPERTIESprovider字段( describe extended <tblName> )。 The value of the property is orc or empty in Spark table and hive for a Hive.属性的值为orc或在 Spark 表和hive为空。

How spark uses this information spark 如何使用这些信息

When provider is not hive (datasource table), Spark uses its native way of processing the data.当 provider 不是hive (数据源表)时,Spark 使用其原生方式处理数据。
If provider is hive , Spark uses Hive code to process the data.如果 provider 是hive ,Spark 使用 Hive 代码来处理数据。

Fileformat文件格式

Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc and Parquet Flags: Spark 提供配置标志以指示引擎使用 Datasource 方式处理流动文件格式的数据 = OrcParquet标志:

Orc

  val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
    .doc("When set to true, the built-in ORC reader and writer are used to process " +
      "ORC tables created by using the HiveQL syntax, instead of Hive serde.")
    .booleanConf
    .createWithDefault(true)

Parquet

val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
    .doc("When set to true, the built-in Parquet reader and writer are used to process " +
      "parquet tables created by using the HiveQL syntax, instead of Hive serde.")
    .booleanConf
    .createWithDefault(true)

I also ran into this kind of problem having multiple joins of internal and external tables.我也遇到了这种具有多个内部和外部表连接的问题。

None of the tricks work including:这些技巧都不起作用,包括:

    spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
    spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
    spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")

anyone who knows how to solve this problem.任何知道如何解决这个问题的人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark 不使用来自 Hive 分区外部表的分区信息 - Spark not using partition information from Hive partitioned external table 使用 spark hivecontext 读取外部配置单元分区表的问题 - Issues with reading external hive partitioned table using spark hivecontext 如何将Spark数据帧另存为已分区的Hive表的分区 - How can I save a spark dataframe as a partition of a partitioned hive table 无法从spark插入到hive分区表中 - unable to insert into hive partitioned table from spark 从spark(2.11)数据帧写入配置单元分区表时,org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions异常 - org.apache.hadoop.hive.ql.metadata.Hive.loadDynamicPartitions exception when writing a hive partitioned table from spark(2.11) dataframe 蜂巢不使用分区来选择外部表中的数据 - hive not using partition to select data in external table 使用USING和PARTITIONED BY子句的spark hive create table的有效语法是什么? - What is valid syntax for spark hive create table with USING and PARTITIONED BY clauses? 无法从 spark sql 插入到 hive 分区表 - Unable to insert to hive partitioned table from spark sql 在 Hive 中将 Spark 数据帧另存为动态分区表 - Save Spark dataframe as dynamic partitioned table in Hive 读取 SPARK SQL 中的分区 HIVE 表 - Reading Partitioned HIVE table in SPARK SQL
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM