简体繁体 English

如何访问 Spark sql 中的 HIVE ACID 表？

[英]How to access the HIVE ACID table in Spark sql?

原文 2018-11-07 23:18:48 5 4 scala/ apache-spark-sql/ hiveql/ pyspark-sql

您如何在 Spark sql 中访问 HIVE ACID 表？

4 个解决方案

We have worked on and open sourced a datasource that will enable users to work on their Hive ACID Transactional tables using Spark.我们已经开发并开源了一个数据源，该数据源将使用户能够使用 Spark 处理他们的 Hive ACID 事务表。

Github: https://github.com/qubole/spark-acid Github： https : //github.com/qubole/spark-acid

It is available as a Spark package and instructions to use it are on the Github page.它以 Spark 包的形式提供，使用说明位于 Github 页面上。 Currently the datasource supports only reading from Hive ACID tables, and we are working on adding the ability to write into these tables via Spark as well.目前数据源仅支持从 Hive ACID 表中读取，我们正在努力添加通过 Spark 写入这些表的功能。

Feedback and suggestions are welcome!欢迎反馈和建议！

@aniket Spark doesn't support reading Hive Acid tables directly. @aniket Spark 不支持直接读取 Hive Acid 表。 ( https://issues.apache.org/jira/browse/SPARK-15348/SPARK-16996 ) The data layout for transactional tables requires special logic to decide which directories to read and how to combine them correctly. ( https://issues.apache.org/jira/browse/SPARK-15348/SPARK-16996 ) 事务表的数据布局需要特殊的逻辑来决定读取哪些目录以及如何正确组合它们。 Some data files may represent updates of previously written rows, for example.例如，一些数据文件可能代表先前写入的行的更新。 Also, if you are reading while something is writing to this table your read may fail (w/o the special logic) because it will try to read incomplete ORC files.此外，如果您正在读取此表的内容，则您的读取可能会失败（没有特殊逻辑），因为它会尝试读取不完整的 ORC 文件。 Compaction may (again w/o the special logic) may make it look like your data is duplicated.压缩可能（同样没有特殊逻辑）可能会使您的数据看起来像是重复的。 It can be done (WIP) via LLAP - tracked in https://issues.apache.org/jira/browse/HIVE-12991它可以通过 LLAP 完成（WIP） - 在https://issues.apache.org/jira/browse/HIVE-12991 中跟踪

I faced the same issue (Spark for Hive acid tables )and I can able to manage with JDBC call from Spark.我遇到了同样的问题（Hive 酸表的 Spark），我可以通过 Spark 的 JDBC 调用进行管理。 May be I can use this JDBC call from spark until we get the native ACID support from Spark.可能我可以从 spark 使用这个 JDBC 调用，直到我们从 Spark 获得本机 ACID 支持。

https://github.com/Gowthamsb12/Spark/blob/master/Spark_ACID https://github.com/Gowthamsb12/Spark/blob/master/Spark_ACID

Spark can read acid table directly at least since spark 2.3.2.至少从 spark 2.3.2 开始，Spark 可以直接读取酸表。 But I can aslo confirm it can't read acid table in spark 2.2.0.但我也可以确认它无法在 spark 2.2.0 中读取酸表。