Pyspark 从 Kafka 流向 Hudi

Question

I'm new using hudi and I have a problem.我是 hudi 的新手，但遇到了问题。 I'm working with an EMR in AWS with pyspark, Kafka and what I want to do is to read a topic from the Kafka cluster with pyspark streaming and then move it to S3 in hudi format.我正在使用 pyspark、Kafka 在 AWS 中使用 EMR，我想做的是使用 pyspark 流从 Kafka 集群读取主题，然后以 hudi 格式将其移动到 S3。 To be honest I've tried a lot since a few weeks ago and I don't know if it is not possible.老实说，自几周前以来我已经尝试了很多，但我不知道这是否可行。 Can someone tell help me, please?有人可以告诉我吗？ The code i'm working with is:我正在使用的代码是：

    #Reading
    df_T = spark.readStream \
        .format("kafka") \
        .options(**options_read) \
        .option("subscribe", topic) \
        .load()

.... ....

    hudi_options = {
        'hoodie.table.name': MyTable,
        'hoodie.datasource.write.table.name': MyTable,
        'hoodie.datasource.write.recordkey.field': MyKeyInTable,
        'hoodie.datasource.write.partitionpath.field': MyPartitionKey,
        'hoodie.datasource.write.hive_style_partitioning': "true",
        'hoodie.datasource.write.row.writer.enable': "false",
        'hoodie.datasource.write.operation': 'bulk_insert',
        'hoodie.datasource.write.precombine.field': MyTimeStamp,
        'hoodie.insert.shuffle.parallelism': 1,
        'hoodie.consistency.check.enabled': "true",
        'hoodie.cleaner.policy': "KEEP_LATEST_COMMITS",
        'hoodie.datasource.write.storage.type': 'MERGE_ON_READ',
        'hoodie.compact.inline': "false",
        'hoodie.datasource.hive_sync.table': MyTable,
        'hoodie.datasource.hive_sync.partition_fields': MyPartitionKey,
        'hoodie.datasource.hive_sync.database' : Mydatabase,
        'hoodie.datasource.hive_sync.auto_create_database': "true",
        'hoodie.datasource.write.keygenerator.class': "org.apache.hudi.keygen.ComplexKeyGenerator",
        'hoodie.datasource.hive_sync.partition_extractor_class': "org.apache.hudi.hive.MultiPartKeysValueExtractor",
        'hoodie.datasource.hive_sync.enable': 'true',
        'hoodie.datasource.hive_sync.skip_ro_suffix': 'true'
    }

.... ....

    ds = df_T \
        .writeStream \
        .outputMode('append') \
        .format("org.apache.hudi") \
        .options(**hudi_options)\
        .option('checkpointLocation', MyCheckpointLocation) \
        .start(MyPathLocation) \
        .awaitTermination(300)

.... ....

This code in the EMR says that works fine, but when i'm going to look for the hudi files it does not create any. EMR 中的这段代码表示工作正常，但当我要查找 hudi 文件时，它不会创建任何文件。 I know that the kafka configuration works, because when in the output mode I set 'console' it works fine, can someone help me?我知道 kafka 配置有效，因为当我在 output 模式下设置“控制台”时，它工作正常，有人可以帮助我吗？

Answer 1

Hello guys I could fix this error, first of all you have to clean the dataframe, not everything, but at least all the fields that the Primary keys you have in the tables are null. As a second point, in the hoodie.datasource.write.precombine.field you can set the大家好我可以修复这个错误，首先你必须清理 dataframe，不是所有的，但至少你在表中的主键是 null 的所有字段。作为第二点，在 hoodie.datasource 中。 write.precombine.field 你可以设置

... ...

import datetime

currentDate = datetime.datetime.now() 

#As for example:

    hudi_options = {
...
        'hoodie.datasource.write.precombine.field': currentDate,
...
    }

Finally, if you don't have a timestamp in your dataframe, you can set this:最后，如果你的 dataframe 中没有时间戳，你可以这样设置：

.withColumn('Loaded_Date', F.lit(currentDate).cast('timestamp'))

Pyspark 从 Kafka 流向 Hudi

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-09-28 17:14:07

Pyspark 从 Kafka 流向 Hudi

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-09-28 17:14:07

解决方案1
1 已采纳 2022-09-28 17:14:07