使用 Spark 结构化流在 postgresql 中插入数据

Question

I am trying to run a structured streaming application using (py)spark.我正在尝试使用（py）spark 运行结构化流应用程序。 My data is read from a Kafka topic and then I am running windowed aggregation on event time.我的数据是从 Kafka 主题中读取的，然后我在事件时间运行窗口聚合。

# I have been able to create data frame pn_data_df after reading data from Kafka

Schema of pn_data_df
  |
   - id StringType
   - source StringType
   - source_id StringType
   - delivered_time TimeStamp

windowed_report_df = pn_data_df.filter(pn_data_df.source == 'campaign') \
    .withWatermark("delivered_time", "24 hours") \
    .groupBy('source_id', window('delivered_time', '15 minute')) \
    .count()
windowed_report_df = windowed_report_df \
    .withColumn('start_ts', unix_timestamp(windowed_report_df.window.start)) \
    .withColumn('end_ts', unix_timestamp(windowed_report_df.window.end)) \
    .selectExpr('CAST(source_id as LONG)', 'start_ts', 'end_ts', 'count')

I am writing this windowed aggregation to my postgresql database which I have already created.我正在将这个窗口聚合写入我已经创建的 postgresql 数据库。

CREATE TABLE pn_delivery_report(
   source_id bigint not null,
   start_ts bigint not null,
   end_ts bigint not null,
   count integer not null,
   unique(source_id, start_ts)
);

Writing to postgresql using spark jdbc allows me to either Append or Overwrite .使用 spark jdbc 写入 postgresql 允许我Append或Overwrite 。 Append mode fails if there is an existing composite key existing in the database, and Overwrite just overwrites entire table with current batch output.如果数据库中存在现有的复合键，则追加模式会失败，并且 Overwrite 只会用当前批处理输出覆盖整个表。

def write_pn_report_to_postgres(df, epoch_id):
    df.write \
    .mode('append') \
    .format('jdbc') \
    .option("url", "jdbc:postgresql://db_endpoint/db") \
    .option("driver", "org.postgresql.Driver") \
    .option("dbtable", "pn_delivery_report") \
    .option("user", "postgres") \
    .option("password", "PASSWORD") \
    .save()

windowed_report_df.writeStream \
   .foreachBatch(write_pn_report_to_postgres) \
   .option("checkpointLocation", '/home/hadoop/campaign_report_df_windowed_checkpoint') \
   .outputMode('update') \
   .start()

How can I execute a query like如何执行查询

INSERT INTO pn_delivery_report (source_id, start_ts, end_ts, COUNT)
VALUES (1001, 125000000001, 125000050000, 128),
       (1002, 125000000001, 125000050000, 127) ON conflict (source_id, start_ts) DO
UPDATE
SET COUNT = excluded.count;

in foreachBatch .在foreachBatch中。

Spark has a jira feature ticket open for it, but it seems that it has not been prioritised till now. Spark 有一张 jira feature ticket 为其开放，但似乎直到现在还没有被优先考虑。

https://issues.apache.org/jira/browse/SPARK-19335 https://issues.apache.org/jira/browse/SPARK-19335

Answer 1

that's worked for me:这对我有用：

def _write_streaming(self,
    df,
    epoch_id
) -> None:         

    df.write \
        .mode('append') \
        .format("jdbc") \
        .option("url", f"jdbc:postgresql://localhost:5432/postgres") \
        .option("driver", "org.postgresql.Driver") \
        .option("dbtable", 'table_test') \
        .option("user", 'user') \
        .option("password", 'password') \
        .save()

df_stream.writeStream \
    .foreachBatch(_write_streaming) \
    .start() \
    .awaitTermination()

You need to add ".awaitTermination()" at the end.您需要在末尾添加“.awaitTermination()”。

使用 Spark 结构化流在 postgresql 中插入数据

问题描述

1 个解决方案

解决方案1
0 2022-07-05 19:27:19

使用 Spark 结构化流在 postgresql 中插入数据

问题描述

1 个解决方案

解决方案1 0 2022-07-05 19:27:19

解决方案1
0 2022-07-05 19:27:19