简体繁体 English

使用Apache Spark / Apache Flink进行扩展

[英]Scaling with Apache Spark/Apache Flink

原文 2017-09-20 15:42:58 3 1 apache-spark/ apache-kafka/ apache-flink

I plan an application that reads from Apache Kafka and after (potentially time-consuming) processing saves data to a database. 我计划一个从Apache Kafka读取的应用程序，然后（可能很耗时）将数据保存到数据库中。

My case are messages, not streams, but for scalability I'm thinking about plugging this into Spark or Flink, but can't grasp how these scale: should my app, when a part of Spark/Flink, read some data from Kafka and then exit or keep reading continuously? 我的情况是消息，而不是流，但是为了可伸缩性，我正在考虑将其插入Spark或Flink，但无法掌握它们的扩展程度：当我的应用程序作为Spark / Flink的一部分时，应该从Kafka中读取一些数据吗？然后退出还是继续阅读？

How will then Spark/Flink decide they must spawn more instances of my app to improve throughput? 然后，Spark / Flink将如何决定它们必须生成更多应用程序实例以提高吞吐量？

Thanks! 谢谢！

1 个解决方案

In Apache Flink you can define the parallelism of the operations by setting the env.setParallelism(#parallelism) to make all operators run with #parallelism parallel instances, or even you can define/override it per operator such as dataStream.map(...).setParallelism(#parallelism); 在Apache Flink中，您可以通过设置env.setParallelism(#parallelism)来使所有运算符与#parallelism并行实例一起运行来定义操作的并行性，甚至可以为每个运算符定义/覆盖它，例如dataStream.map(...).setParallelism(#parallelism); . 。

For more info Check Flink docs https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/parallel.html . 有关更多信息，请检查Flink docs https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/parallel.html 。

Regarding reading from Kafa you can define the parallel receivers (same group) to scale up/down with the Kafka topic partitions : env.addSource(kafkaConsumer).setParallelism(#topicPartitions) 关于从Kafa读取内容，您可以定义并行接收器（同一组），以使用Kafka主题分区按比例放大/缩小： env.addSource(kafkaConsumer).setParallelism(#topicPartitions)

Check Kafka documentation for more info about Kafka topic and partitions and consumer group : https://kafka.apache.org/documentation/ . 查看Kafka文档以获取有关Kafka主题，分区和使用者组的更多信息： https : //kafka.apache.org/documentation/ 。

Note that if you don't specify the parallelism level inside the Flink program and you deploy it on local Flink cluster. 请注意，如果您未在Flink程序中指定并行度级别，而是将其部署在本地Flink群集上。 The value of parallelism.default parameter inside the configs file flinkDir/conf/flink-conf.yaml will be used. 将使用配置文件flinkDir/conf/flink-conf.yaml的parallelism.default参数的值。 Unless you specify it by the -p like ./bin/flink run .... -p #parallelism . 除非您通过-p指定它，如./bin/flink run .... -p #parallelism 。 check Flink cli options . 检查Flink cli选项。