简体   繁体   English

Apache Hudi on Dataproc

[英]Apache Hudi on Dataproc

Is there any guide to deploy Apache Hudi on a Dataproc Cluster?是否有在 Dataproc 集群上部署 Apache Hudi 的指南? i´m trying to deploy via Hudi Quick Start Guide but i can´t.我正在尝试通过Hudi 快速入门指南进行部署,但我不能。

Spark 3.1.1火花 3.1.1

Python 3.8.13 Python 3.8.13

Debian 5.10.127 x86_64 Debian 5.10.127 x86_64

launch code:启动代码:

pyspark --jars gs://bucket/artifacts/hudi-spark3.1.x_2.12-0.11.1.jar,gs://bucket/artifacts/spark-avro_2.12-3.1.3.jar \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog'

Try:尝试:

dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()

Error:错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'JavaPackage' object is not callable

Edit 1:编辑 1:

pyspark --jars gs://bucket/artifacts/hudi-spark3.1.x_2.12-0.11.1.jar,gs://bucket/artifacts/spark-avro_2.12-3.1.3.jar --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

Throw conf error抛出配置错误

WARN org.apache.spark.sql.SparkSession: Cannot use org.apache.spark.sql.hudi.HoodieSparkSessionExtension to configure session extensions.警告 org.apache.spark.sql.SparkSession:无法使用 org.apache.spark.sql.hudi.HoodieSparkSessionExtension 配置 session 扩展。 java.lang.ClassNotFoundException: org.apache.spark.sql.hudi.HoodieSparkSessionExtension. java.lang.ClassNotFoundException:org.apache.spark.sql.hudi.HoodieSparkSessionExtension。

and also get same error trying sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()并且在尝试 sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator() 时也会遇到同样的错误

Edit 2:编辑 2:

I was using wrong.jar..., this edit correct first problem我用错了。jar ...,此编辑更正了第一个问题

Correct pyspark call:正确拨打pyspark:

pyspark --jars gs://dev-dama-stg-spark/artifacts/hudi-spark3.1-bundle_2.12-0.12.1.jar,gs://dev-dama-stg-spark/artifacts/spark-avro_2.12-3.1.3.jar --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

However, new errors... after create table and hudi.options:然而,新的错误......在创建表和 hudi.options 之后:

22/12/01 22:26:04 WARN org.apache.hudi.common.config.DFSPropertiesConfiguration: Cannot find HUDI_CONF_DIR, please set it as the dir of hudi-defaults.conf
22/12/01 22:26:04 WARN org.apache.hudi.common.config.DFSPropertiesConfiguration: Properties file file:/etc/hudi/conf/hudi-defaults.conf not found. Ignoring to load props file
22/12/01 22:26:05 WARN org.apache.hudi.metadata.HoodieBackedTableMetadata: Metadata table was not found at path file:/tmp/hudi_trips_cow/.hoodie/metadata
22/12/01 22:26:07 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 2) (... 2): java.io.FileNotFoundException: File file:/tmp/hudi_trips_cow does not exist

Any clues...?任何线索......?

Found the solution my self.我自己找到了解决方案。

first, to launch correctly pyspark, include hudi-spark_bundle and spark-avro as jars. Also, in my case i want to include some jdbc jars to connect with my on-premise service:`首先,要正确启动 pyspark,请将 hudi-spark_bundlespark-avro包括为 jars。另外,在我的例子中,我想包括一些 jdbc jars 以连接我的本地服务:`

pyspark --jars gs://bucket/artifacts/hudi-spark3.1-bundle_2.12-0.12.1.jar,
gs://bucket/artifacts/spark-avro_2.12-3.1.3.jar,
gs://bucket/artifacts/mssql-jdbc-11.2.1.jre8.jar,
gs://bucket/artifacts/ngdbc-2.12.9.jar \
--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
--conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension'

Follow the hudi quick guide and the only thing to change from this:遵循 hudi 快速指南,唯一需要改变的是:

basePath = "file:///tmp/hudi_trips_cow"

to this对此

basePath = "gs://bucket/tmp/hudi_trips_cow"

With this configuration i was able to run correctly hudi in Dataproc.通过此配置,我能够在 Dataproc 中正确运行 hudi。

If i find new information i will- post here to keep this as a short guide.如果我找到新信息,我会在此处发布以作为简短指南。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Apache Hudi 创建和 append Dataproc 和 Cloud Storage 上的 Upsert 表(Parquet 格式) - Apache Hudi create and append Upsert table (Parquet-format) on Dataproc & Cloud Storage 有没有办法在 AWS 胶水上使用 Apache Hudi? - Is there a way to use Apache Hudi on AWS glue? 如何加密s3中存在的apache hudi外部表数据通过spark作业同步到hive表中 - How to encrypt apache hudi external tables data present in s3 synced into hive tables through spark jobs 从 Apache Hudi 表中删除记录,这是使用 AWS Glue Job 和 Kinesis 创建的 Glue 表的一部分 - Deleting records from Apache Hudi Table which is part of Glue Tables created using AWS Glue Job and Kinesis 如果我使用 Dataproc,它如何处理从 Apache Hadoop 和 Spark 到 Dataproc 的实时流数据? - If I use Dataproc, how does it process real-time streaming data from Apache Hadoop and Spark to Dataproc? 错误 - 使用 Apache Sqoop 和 Dataproc 从 SQL 服务器导入 GCS - ERROR - Import from SQL Server to GCS using Apache Sqoop & Dataproc java.lang.ClassNotFoundException:找不到数据源:hudi。 请在 http://spark.apache.org/third-party-projects.html 找到包 - java.lang.ClassNotFoundException: Failed to find data source: hudi. Please find packages at http://spark.apache.org/third-party-projects.html Pyspark 从 Kafka 流向 Hudi - Pyspark streaming from Kafka to Hudi 如何更新/删除 AWS 中 hudi 表中的记录? - how to update/delete a record in hudi table in AWS? Hudi-Glue-与EMR集群集成 - Hudi-Glue-Integration with EMR cluster
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM