使用Spark将数据配置为Kafka主题

Question

I am trying to write data in Hive table to Kafka topic using Spark. 我正在尝试使用Spark将Hive表中的数据写入Kafka主题。

I am working on writing a data frame of 9 million records (per day) to a Kafka topic using the query: 我正在使用查询将900万条记录（每天）的数据帧（每天）写入Kafka主题：

val ds=df.selectExpr("topic", "CAST(key AS STRING)", "CAST(value AS STRING)")
.write.format("kafka").option("kafka.bootstrap.servers", "host1:port1,host2:port2").start()

Can this query have the capability to write that huge amount of data to the kafka topic? 此查询是否可以将大量数据写入kafka主题？

If yes, how much time it could take to complete writing the data? 如果是，完成数据写入需要花费多少时间？

If not, what are the other possible ways to do it? 如果没有，还有其他可能的方法吗？

Answer 1

You can use batch processing if the task is to do the above mentioned operation daily and not in real-time. 如果任务是每天而不是实时进行上述操作，则可以使用批处理。

9 million records can be handled easily with this. 这样就可以轻松处理900万条记录。

The time required to do this depends on the cluster configuration and also on the intermediate processing required. 所需的时间取决于群集配置以及所需的中间处理。

使用Spark将数据配置为Kafka主题

问题描述

1 个解决方案

解决方案1
0 2019-07-22 18:02:27

使用Spark将数据配置为Kafka主题

问题描述

1 个解决方案

解决方案1 0 2019-07-22 18:02:27

解决方案1
0 2019-07-22 18:02:27