使用 Flink 從 2 個數據源中查找缺失的記錄

Question

我有兩個數據源——一個 S3 存儲桶和一個 postgres 數據庫表。 兩個來源都具有相同格式的記錄，並具有類型為 uuid 的唯一標識符。 S3 存儲桶中存在的一些記錄不是 postgres 表的一部分，目的是找到那些丟失的記錄。 數據是有界的，因為它在 s3 存儲桶中按每天進行分區。

讀取 s3-source（我相信這個操作是以批處理模式讀取數據，因為我沒有提供 monitorContinuously() 參數）-


    final FileSource<GenericRecord> source = FileSource.forRecordStreamFormat(
                                             AvroParquetReaders.forGenericRecord(schema), path).build();
    
    final DataStream<GenericRecord> avroStream = env.fromSource(
                                                 source, WatermarkStrategy.noWatermarks(), "s3-source");
    
    DataStream<Row> s3Stream = avroStream.map(x -> Row.of(x.get("uuid").toString()))
                                      .returns(Types.ROW_NAMED(new String[] {"uuid"}, Types.STRING));
    
    Table s3table = tableEnv.fromDataStream(s3Stream); 
    tableEnv.createTemporaryView("s3table", s3table);

為了從 Postgres 閱讀，我創建了一個 postgres 目錄 -

    PostgresCatalog postgresCatalog = (PostgresCatalog) JdbcCatalogUtils.createCatalog(
            catalogName,
            defaultDatabase,
            username,
            pwd,
            baseUrl);
    
    tableEnv.registerCatalog(postgresCatalog.getName(), postgresCatalog);
    tableEnv.useCatalog(postgresCatalog.getName());
    
    Table dbtable = tableEnv.sqlQuery("select cast(uuid as varchar) from `localschema.table`");
    tableEnv.createTemporaryView("dbtable", dbtable);

我的意圖是簡單地執行左連接並從 dbtable 中找到丟失的記錄。 像這樣的東西 -

    Table resultTable = tableEnv.sqlQuery("SELECT * FROM s3table LEFT JOIN dbtable ON s3table.uuid = dbtable.uuid where dbtable.uuid is null");
    DataStream<Row> resultStream = tableEnv.toDataStream(resultTable);
    resultStream.print();

但是，似乎還不支持 UUID 列類型，因為我收到以下異常。

Caused by: java.lang.UnsupportedOperationException: Doesn't support Postgres type 'uuid' yet
    at org.apache.flink.connector.jdbc.dialect.psql.PostgresTypeMapper.mapping(PostgresTypeMapper.java:171)

作為替代方案，我嘗試按如下方式讀取數據庫表 -

    TypeInformation<?>[] fieldTypes = new TypeInformation<?>[] {
            BasicTypeInfo.of(String.class)
    };
    RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
    JdbcInputFormat jdbcInputFormat = JdbcInputFormat.buildJdbcInputFormat()
                                              .setDrivername("org.postgresql.Driver")
                                              .setDBUrl("jdbc:postgresql://127.0.0.1:5432/localdatabase")
                                              .setQuery("select cast(uuid as varchar) from localschema.table")
                                              .setUsername("postgres")
                                              .setPassword("postgres")
                                              .setRowTypeInfo(rowTypeInfo)
                                              .finish();

    DataStream<Row> dbStream = env.createInput(jdbcInputFormat);

    Table dbtable = tableEnv.fromDataStream(dbStream).as("uuid");
    tableEnv.createTemporaryView("dbtable", dbtable);

只有這一次，我在執行左連接時遇到以下異常（如上所述）-

Exception in thread "main" org.apache.flink.table.api.TableException: Table sink '*anonymous_datastream_sink$3*' doesn't support consuming update and delete changes which is produced by node Join(joinType=[LeftOuterJoin]

如果我調整 resultStream 以發布 changeLogStream 它會起作用 -

Table resultTable = tableEnv.sqlQuery("SELECT * FROM s3table LEFT JOIN dbtable ON s3table.sync_id = dbtable.sync_id where dbtable.sync_id is null");

DataStream<Row> resultStream = tableEnv.toChangelogStream(resultTable);
resultStream.print();

Sample O/P

+I[9cc38226-bcce-47ce-befc-3576195a0933, null]
+I[a24bf933-1bb7-425f-b1a7-588fb175fa11, null]
+I[da6f57c8-3ad1-4df5-9636-c6b36df2695f, null]
+I[2f3845c1-6444-44b6-b1e8-c694eee63403, null]
-D[9cc38226-bcce-47ce-befc-3576195a0933, null]
-D[a24bf933-1bb7-425f-b1a7-588fb175fa11, null]

但是，我不希望接收器將插入和刪除分開。 我只想要缺少 uuid 的最終列表。 我猜這是因為我的 Postgres 源是用DataStream<Row> dbStream = env.createInput(jdbcInputFormat);創建的是流媒體資源。 如果我嘗試以 BATCH 模式執行整個應用程序，我會得到以下異常 -

org.apache.flink.table.api.ValidationException: Querying an unbounded table '*anonymous_datastream_source$2*' in batch mode is not allowed. The table source is unbounded.

是否有可能有一個有界的 JDBC 來源？ 如果沒有，我如何使用流式傳輸 API 實現此目的。（使用 Flink 版本 - 1.15.2）

我相信這種情況將是一個可以用 Flink 實現的常見用例，但顯然我遺漏了一些東西。 任何線索將不勝感激。

Answer 1

目前常見的方法是將 resultStream 匯入表中。 因此，您可以安排一個截斷表的作業，然后執行 Apache Flink 作業。 然后從這個表中讀取結果。

我還注意到 Apache Flink Table Store 0.3.0 剛剛發布。 他們對 0.4.0 的路線圖有具體的看法。 這也可能是一個解決方案。 恕我直言，非常令人興奮。

使用 Flink 從 2 個數據源中查找缺失的記錄

問題描述

1 個解決方案

解決方案1
0 2023-01-16 14:11:05

使用 Flink 從 2 個數據源中查找缺失的記錄

問題描述

1 個解決方案

解決方案1 0 2023-01-16 14:11:05

解決方案1
0 2023-01-16 14:11:05