[英]Apache Spark not using partition information from Hive partitioned external table
I have a simple Hive-External table which is created on top of S3 (Files are in CSV format).我有一个简单的 Hive-External 表,它是在 S3 之上创建的(文件为 CSV 格式)。 When I run the hive query it shows all records and partitions.
当我运行 hive 查询时,它会显示所有记录和分区。
However when I use the same table in Spark ( where the Spark SQL has a where condition on the partition column) it does not show that a partition filter is applied.但是,当我在 Spark 中使用同一个表时(其中 Spark SQL 在分区列上有一个 where 条件),它没有显示应用了分区过滤器。 However for a Hive Managed table , Spark is able to use the information of partitions and apply the partition filter.
但是对于 Hive Managed table ,Spark 能够使用分区信息并应用分区过滤器。
Is there any flag or setting that can help me make use of partitions of Hive external tables in Spark ?是否有任何标志或设置可以帮助我在 Spark 中使用 Hive 外部表的分区? Thanks.
谢谢。
Update :更新 :
For some reason, only the spark plan is not showing the Partition Filters.出于某种原因,只有火花计划没有显示分区过滤器。 However, when you look at the data loaded its only loading the data needed from the partitions.
但是,当您查看加载的数据时,它只会加载分区所需的数据。
Ex: Where rating=0 , loads only one file of 1 MB, when I don't have filter its reads all 3 partition for 3 MB例如:其中 rating=0 ,仅加载一个 1 MB 的文件,当我没有过滤器时,它会读取 3 MB 的所有 3 个分区
tl; tl; dr set the following before the running sql for external table
spark.sql("set spark.sql.hive.convertMetastoreOrc=true")
dr 在运行 sql 之前为外部表
spark.sql("set spark.sql.hive.convertMetastoreOrc=true")
The difference in behaviour is not because of extenal/managed table.行为的差异不是因为外部/托管表。
The behaviour depends on two factors行为取决于两个因素
1. Where the table was created(Hive or Spark) 1. 表的创建位置(Hive 或 Spark)
2. File format (I believe it is ORC in this case, from the screen capture) 2.文件格式(我相信在这种情况下是ORC,来自屏幕截图)
If the table was create using Spark APIs, it is considered as Datasource table.如果该表是使用 Spark API 创建的,则将其视为数据源表。
If the table was created usng HiveQL, it is considered as Hive native table.如果该表是使用 HiveQL 创建的,则它被视为 Hive 本机表。
The metadata of both these tables are store in Hive metastore, the only difference is in the provider
field of TBLPROPERTIES
of the tables( describe extended <tblName>
).这两个表的元数据都存储在 Hive Metastore 中,唯一的区别在于表的
TBLPROPERTIES
的provider
字段( describe extended <tblName>
)。 The value of the property is orc
or empty in Spark table and hive
for a Hive.属性的值为
orc
或在 Spark 表和hive
为空。
When provider is not hive
(datasource table), Spark uses its native way of processing the data.当 provider 不是
hive
(数据源表)时,Spark 使用其原生方式处理数据。
If provider is hive
, Spark uses Hive code to process the data.如果 provider 是
hive
,Spark 使用 Hive 代码来处理数据。
Spark gives config flag to instruct the engine to use Datasource way of processing the data for the floowing file formats = Orc
and Parquet
Flags: Spark 提供配置标志以指示引擎使用 Datasource 方式处理流动文件格式的数据 =
Orc
和Parquet
标志:
Orc
val CONVERT_METASTORE_ORC = buildConf("spark.sql.hive.convertMetastoreOrc")
.doc("When set to true, the built-in ORC reader and writer are used to process " +
"ORC tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
Parquet
val CONVERT_METASTORE_PARQUET = buildConf("spark.sql.hive.convertMetastoreParquet")
.doc("When set to true, the built-in Parquet reader and writer are used to process " +
"parquet tables created by using the HiveQL syntax, instead of Hive serde.")
.booleanConf
.createWithDefault(true)
I also ran into this kind of problem having multiple joins of internal and external tables.我也遇到了这种具有多个内部和外部表连接的问题。
None of the tricks work including:这些技巧都不起作用,包括:
spark.sql("set spark.sql.hive.convertMetastoreParquet=false")
spark.sql("set spark.sql.hive.metastorePartitionPruning=true")
spark.sql("set spark.sql.hive.caseSensitiveInferenceMode=NEVER_INFER")
anyone who knows how to solve this problem.任何知道如何解决这个问题的人。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.