简体繁体 English

在 Flink 中使用 RichMap 比如 Scala MapPartition

[英]Use RichMap in Flink like Scala MapPartition

原文 2023-01-30 08:58:42 8 1 apache-spark/ apache-flink/ flink-streaming

In Spark, we have MapPartition function, which is used to do some initialization for a group of entries, like some db operation.在 Spark 中，我们有 MapPartition function，它用于对一组条目进行一些初始化，比如一些 db 操作。

Now I want to do the same thing in Flink.现在我想在 Flink 中做同样的事情。 After some research I found out that I can use RichMap for the same use but it has a drawback that the operation can be done only at the open method which will be at the start of a streaming job.经过一些研究，我发现我可以使用 RichMap 来实现相同的用途，但它有一个缺点，即该操作只能在流式作业开始时的 open 方法中完成。 I will explain my use case which will clarify the situtaion.我将解释我的用例，这将阐明情况。

Example: I am getting data for a millions of users in kafka, but I only want the data of some users to be finally persisted.示例：我正在获取kafka中数百万用户的数据，但我只想最终持久化部分用户的数据。 Now this list of users is dynamic and is available in a db.现在这个用户列表是动态的并且在数据库中可用。 I wanted to lookup the current users every 10mins, so that I filter out and store the data for only those users.我想每 10 分钟查找一次当前用户，以便我过滤并仅存储这些用户的数据。 In Spark(MapPartition), it would do the user lookup for every group and there I had configured to get users from the DB after every 10mins.在 Spark(MapPartition) 中，它会为每个组执行用户查找，并且我已配置为每 10 分钟后从数据库中获取用户。 But with Flink using RichMap I can do that only in the open function when my job starts.但是对于使用 RichMap 的 Flink，我只能在工作开始时在打开的 function 中执行此操作。

How can I do the following operation in Flink?如何在Flink中进行如下操作？

1 个解决方案

It seems that what You want to do is stream-table join.看来您想要做的是流表连接。 There are multiple ways of doing that, but seems that the easiest one would be to use Broadcast state pattern here .有多种方法可以做到这一点，但似乎最简单的方法是在此处使用广播 state 模式。

The idea is to define custom DataSource that periodically queries data from SQL table (or even better use CDC), use that tableStream as broadcast state and connect it with actual users stream.这个想法是定义自定义DataSource ，定期从 SQL 表中查询数据（或者更好地使用 CDC），使用该表流作为广播tableStream并将其与实际用户 stream 连接。

Inside the ProcessFunction for the connected streams You will have access to the broadcasted table data and You can perform lookup for every user You receive and decide what to do with that.在连接流的ProcessFunction内部，您将有权访问广播的表数据，您可以为收到的每个用户执行查找并决定如何处理。