Kinesis Analytics Studio 上的 PyFlink - 無法將 DataStream 轉換為 Amazon Kinesis Data Stream

Question

我有一個來自 CoFlatMapFunction 的 DataStream <pyflink.datastream.data_stream.DataStream> （此處已簡化）：

%flink.pyflink
# join two streams and update the rule-set
class MyCoFlatMapFunction(CoFlatMapFunction):

    def open(self, runtime_context: RuntimeContext):
        state_desc = MapStateDescriptor('map', Types.STRING(), Types.BOOLEAN())
        self.state = runtime_context.get_map_state(state_desc)

    def bool_from_user_number(self, user_number: int):
        '''Retunrs True if user_number is greater than 0, False otherwise.'''
        if user_number > 0:
            return True
        else:
            return False

    def flat_map1(self, value):
        '''This method is called for each element in the first of the connected streams'''
        self.state.put(value[1], self.bool_from_user_number(value[2]))

    def flat_map2(self, value):
        '''This method is called for each element in the second of the connected streams (exchange_server_tickers_data_py)'''

        current_dateTime = datetime.now()
        dt = current_dateTime

        x = value[1]
        y = value[2]

        yield Row(dt, x, y)

def generate__ds(st_env):
    # interpret the updating Tables as DataStreams
    type_info1 = Types.ROW([Types.SQL_TIMESTAMP(), Types.STRING(), Types.INT()])
    ds1 = st_env.to_append_stream(table_1 , type_info=type_info1)

    type_info2 = Types.ROW([Types.SQL_TIMESTAMP(), Types.STRING(), Types.STRING()])
    ds2 = st_env.to_append_stream(table_2 , type_info=type_info2)

    output_type_info = Types.ROW([ Types.PICKLED_BYTE_ARRAY() ,Types.STRING(),Types.STRING() ])
    # Connect the two streams
    connected_ds = ds1.connect(ds2)
    # Apply the CoFlatMapFunction
    ds = connected_ds.key_by(lambda a: a[0], lambda a: a[0]).flat_map(MyCoFlatMapFunction(), output_type_info)
    return ds

ds = generate__ds(st_env)

但是，我無法通過將其注冊為視圖/表、寫入接收器表或（最好的情況）使用 Kinesis Streams 接收器將數據從 Flink 流寫入 Kinesis 流來查看輸出。 Firehouse 也不適合我的用例，因為 30 秒的延遲太長了。 任何幫助將不勝感激，謝謝！

我試過的：

像這樣將其注冊為視圖/表：

# interpret the DataStream as a Table
input_table = st_env.from_data_stream(ds).alias("dt", "x", "y")
z.show(input_table, stream_type="update")

這給出了一個錯誤：

Query schema: [dt: RAW('[B', '...'), x: STRING, y: STRING]
Sink schema:  [dt: RAW('[B', ?), x: STRING, y: STRING]

我也試過寫一個水槽表，像這樣：

%flink.pyflink
# create a sink table to emit results
st_env.execute_sql("""DROP TABLE IF EXISTS table_sink""")

st_env.execute_sql("""
    CREATE TABLE table_sink (
        dt RAW('[B', '...'),
        x VARCHAR(32),
        y STRING
    ) WITH (
        'connector' = 'print'
    )
""")

# convert the Table API table to a SQL view
table = st_env.from_data_stream(ds).alias("dt", "spread", "spread_orderbook")
st_env.execute_sql("""DROP TEMPORARY VIEW IF EXISTS table_api_table""")
st_env.create_temporary_view('table_api_table', table)

# emit the Table API table
st_env.execute_sql("INSERT INTO table_sink SELECT * FROM table_api_table").wait()

我收到錯誤： org.apache.flink.table.api.ValidationException: Unable to restore the RAW type of class '[B' with serializer snapshot '...'.

我還嘗試使用接收器和add_sink將數據寫入接收器，這將是這些文檔中的 AWS 運動數據流，如下所示：

%flink.pyflink
from pyflink.common.serialization import JsonRowSerializationSchema
from pyflink.datastream.connectors import KinesisStreamsSink

output_type_info = Types.ROW([Types.SQL_TIMESTAMP(), Types.STRING(), Types.STRING()])
serialization_schema = JsonRowSerializationSchema.Builder().with_type_info(output_type_info).build()

# Required
sink_properties = {
    'aws.region': 'eu-west-2'
}

kds_sink = KinesisStreamsSink.builder()
            .set_kinesis_client_properties(sink_properties)
            .set_serialization_schema(SimpleStringSchema())
            .set_partition_key_generator(PartitionKeyGenerator
            .fixed())
            .set_stream_name("test_stream")
            .set_fail_on_error(False)
            .set_max_batch_size(500)
            .set_max_in_flight_requests(50)
            .set_max_buffered_requests(10000)
            .set_max_batch_size_in_bytes(5 * 1024 * 1024)
            .set_max_time_in_buffer_ms(5000)
            .set_max_record_size_in_bytes(1 * 1024 * 1024)
            .build()

ds.sink_to(kds_sink)

我認為這會起作用，但在pyflink.datastream.connectors中找不到 KinesisStreamsSink，而且我無法在 AWS Kinesis Analytics Studio 中找到有關如何執行此操作的任何文檔。 任何幫助將不勝感激，謝謝？ 我將如何將數據寫入 Kinesis Streams 接收器/將其轉換為表？

Answer 1

好吧，我想通了。 AWS Kinesis Analytics Studio (1.13) 上可用的特定 Pyflink 版本存在一些問題。 錯誤消息本身並沒有多大用處，所以對於自己遇到問題的任何人，我真的建議在 Flink Web UI 中查看錯誤。 首先，必須使用Types.PICKLED_BYTE_ARRAY()指定MapStateDescriptor數據類型。 其次，Qn 中未顯示，但每個MapStateDescriptor必須具有不同的名稱。 我還發現使用pyflink.common中的Row會給我帶來錯誤。 通過指定Types.TUPLE()切換到使用元組對我來說效果更好，如本示例中所做的那樣。 我還必須切換到將輸出指定為元組。

我還沒有做的另一件事是為 DataStream 指定水印策略，這可能通過從第一個字段中提取時間戳來完成，並根據流的知識分配水印：

class MyTimestampAssigner(TimestampAssigner):
    def extract_timestamp(self, value, record_timestamp: int) -> int:
        return int(value[0])

watermark_strategy = WatermarkStrategy.for_bounded_out_of_orderness(Duration.of_seconds(5)).with_timestamp_assigner(MyTimestampAssigner())
ds = ds.assign_timestamps_and_watermarks(watermark_strategy)

# the first field has been used for timestamp extraction, and is no longer necessary
# replace first field with a logical event time attribute
table = st_env.from_data_stream(ds, col("dt").rowtime, col('f0'), col('f1'))

但我創建了一個接收器表，用於再次寫入 Kinesis 數據流作為輸出。 總的來說，更正后的代碼看起來像這樣：


from pyflink.table.expressions import col
from pyflink.datastream.state import MapStateDescriptor
from pyflink.datastream.functions import RuntimeContext, CoFlatMapFunction
from pyflink.common.typeinfo import Types
from pyflink.common import Duration as Time, WatermarkStrategy, Duration
from pyflink.common.typeinfo import Types
from pyflink.common.watermark_strategy import TimestampAssigner
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.functions import KeyedProcessFunction, RuntimeContext
from pyflink.datastream.state import ValueStateDescriptor
from datetime import datetime 

# Register the tables in the env
table1 = st_env.from_path("sql_table_1")
table2 = st_env.from_path("sql_table_2")

# interpret the updating Tables as DataStreams
type_info1 = Types.TUPLE([Types.SQL_TIMESTAMP(), Types.STRING(), Types.INT()])
ds1 = st_env.to_append_stream(table2, type_info=type_info1)

type_info2 = Types.TUPLE([Types.SQL_TIMESTAMP(), Types.STRING(), Types.STRING()])
ds2 = st_env.to_append_stream(table1, type_info=type_info2)

# join two streams and update the rule-set state
class MyCoFlatMapFunction(CoFlatMapFunction):
    
    def open(self, runtime_context: RuntimeContext):
        '''This method is called when the function is opened in the runtime. It is the initialization purposes.'''
        # Map state that we use to maintain the filtering and rules
        state_desc = MapStateDescriptor('map', Types.PICKLED_BYTE_ARRAY(), Types.PICKLED_BYTE_ARRAY())
        self.state = runtime_context.get_map_state(state_desc)
        
        # maintain state 2
        ob_state_desc = MapStateDescriptor('map_OB', Types.PICKLED_BYTE_ARRAY(), Types.PICKLED_BYTE_ARRAY())
        self.ob_state = runtime_context.get_map_state(ob_state_desc)
                   
    # called on ds1
    def flat_map1(self, value):
        '''This method is called for each element in the first of the connected streams '''
        list_res = value[1].split('|')
        for i in list_res:
            time = datetime.utcnow().replace(microsecond=0)
            yield (time, f"{i}_one")

    # called on ds2
    def flat_map2(self, value):
        '''This method is called for each element in the second of the connected streams'''
        list_res = value[1].split('|')
        for i in list_res:
            time = datetime.utcnow().replace(microsecond=0)
            yield (time, f"{i}_two")
        
connectedStreams = ds1.connect(ds2)
output_type_info = Types.TUPLE([Types.SQL_TIMESTAMP(), Types.STRING()])
ds = connectedStreams.key_by(lambda value: value[1], lambda value: value[1]).flat_map(MyCoFlatMapFunction(), output_type=output_type_info)


name = 'output_table'
ds_table_name = 'temporary_table_dump'

st_env.execute_sql(f"""DROP TABLE IF EXISTS {name}""")

def create_table(table_name, stream_name, region, stream_initpos):
    return """ CREATE TABLE {0} (
                f0 TIMESTAMP(3),
                f1 STRING,
                WATERMARK FOR f0 AS f0 - INTERVAL '5' SECOND
              )
              WITH (
                'connector' = 'kinesis',
                'stream' = '{1}',
                'aws.region' = '{2}',
                'scan.stream.initpos' = '{3}',
                'sink.partitioner-field-delimiter' = ';',
                'sink.producer.collection-max-count' = '100',
                'format' = 'json',
                'json.timestamp-format.standard' = 'ISO-8601'
              ) """.format(
        table_name, stream_name, region, stream_initpos
    )

# Creates a sink table writing to a Kinesis Data Stream
st_env.execute_sql(create_table(name, 'output-test', 'eu-west-2', 'LATEST'))
table = st_env.from_data_stream(ds)
st_env.execute_sql(f"""DROP TEMPORARY VIEW IF EXISTS {ds_table_name}""")
st_env.create_temporary_view(ds_table_name, table)

# emit the Table API table 
st_env.execute_sql(f"INSERT INTO {name} SELECT * FROM {ds_table_name}").wait()

Kinesis Analytics Studio 上的 PyFlink - 無法將 DataStream 轉換為 Amazon Kinesis Data Stream

問題描述

我試過的：

1 個解決方案

解決方案1
0 2022-12-20 11:58:46

Kinesis Analytics Studio 上的 PyFlink - 無法將 DataStream 轉換為 Amazon Kinesis Data Stream

問題描述

我試過的：

1 個解決方案

解決方案1 0 2022-12-20 11:58:46

解決方案1
0 2022-12-20 11:58:46