简体   繁体   English

使用 Spark 结构化流在 postgresql 中插入数据

[英]Upsert data in postgresql using spark structured streaming

I am trying to run a structured streaming application using (py)spark.我正在尝试使用(py)spark 运行结构化流应用程序。 My data is read from a Kafka topic and then I am running windowed aggregation on event time.我的数据是从 Kafka 主题中读取的,然后我在事件时间运行窗口聚合。

# I have been able to create data frame pn_data_df after reading data from Kafka

Schema of pn_data_df
  |
   - id StringType
   - source StringType
   - source_id StringType
   - delivered_time TimeStamp

windowed_report_df = pn_data_df.filter(pn_data_df.source == 'campaign') \
    .withWatermark("delivered_time", "24 hours") \
    .groupBy('source_id', window('delivered_time', '15 minute')) \
    .count()
windowed_report_df = windowed_report_df \
    .withColumn('start_ts', unix_timestamp(windowed_report_df.window.start)) \
    .withColumn('end_ts', unix_timestamp(windowed_report_df.window.end)) \
    .selectExpr('CAST(source_id as LONG)', 'start_ts', 'end_ts', 'count')

I am writing this windowed aggregation to my postgresql database which I have already created.我正在将这个窗口聚合写入我已经创建的 postgresql 数据库。

CREATE TABLE pn_delivery_report(
   source_id bigint not null,
   start_ts bigint not null,
   end_ts bigint not null,
   count integer not null,
   unique(source_id, start_ts)
);

Writing to postgresql using spark jdbc allows me to either Append or Overwrite .使用 spark jdbc 写入 postgresql 允许我AppendOverwrite Append mode fails if there is an existing composite key existing in the database, and Overwrite just overwrites entire table with current batch output.如果数据库中存在现有的复合键,则追加模式会失败,并且 Overwrite 只会用当前批处理输出覆盖整个表。

def write_pn_report_to_postgres(df, epoch_id):
    df.write \
    .mode('append') \
    .format('jdbc') \
    .option("url", "jdbc:postgresql://db_endpoint/db") \
    .option("driver", "org.postgresql.Driver") \
    .option("dbtable", "pn_delivery_report") \
    .option("user", "postgres") \
    .option("password", "PASSWORD") \
    .save()

windowed_report_df.writeStream \
   .foreachBatch(write_pn_report_to_postgres) \
   .option("checkpointLocation", '/home/hadoop/campaign_report_df_windowed_checkpoint') \
   .outputMode('update') \
   .start()

How can I execute a query like如何执行查询

INSERT INTO pn_delivery_report (source_id, start_ts, end_ts, COUNT)
VALUES (1001, 125000000001, 125000050000, 128),
       (1002, 125000000001, 125000050000, 127) ON conflict (source_id, start_ts) DO
UPDATE
SET COUNT = excluded.count;

in foreachBatch .foreachBatch中。

Spark has a jira feature ticket open for it, but it seems that it has not been prioritised till now. Spark 有一张 jira feature ticket 为其开放,但似乎直到现在还没有被优先考虑。

https://issues.apache.org/jira/browse/SPARK-19335 https://issues.apache.org/jira/browse/SPARK-19335

that's worked for me:这对我有用:

def _write_streaming(self,
    df,
    epoch_id
) -> None:         

    df.write \
        .mode('append') \
        .format("jdbc") \
        .option("url", f"jdbc:postgresql://localhost:5432/postgres") \
        .option("driver", "org.postgresql.Driver") \
        .option("dbtable", 'table_test') \
        .option("user", 'user') \
        .option("password", 'password') \
        .save()

df_stream.writeStream \
    .foreachBatch(_write_streaming) \
    .start() \
    .awaitTermination()

You need to add ".awaitTermination()" at the end.您需要在末尾添加“.awaitTermination()”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 spark 结构化流的 upsert(合并)增量 - upsert (merge) delta with spark structured streaming 任何人都尝试使用 Spark 结构化流将数据流式传输到 Redshift - Anyone try to streaming data to Redshift using spark structured streaming 处理数据 - Spark结构化流媒体 - Handling data - Spark structured streaming Spark Structured Streaming foreachBatch 和 UPSERT(合并):坚持还是不坚持? - Spark Structured Streaming foreachBatch and UPSERT (merge): to persist or not to persist? 将 Spark 累加器与结构化流结合使用 - Using Spark Accumulators with Structured Streaming 无法使用Spark Structured Streaming在Parquet文件中写入数据 - Not able to write Data in Parquet File using Spark Structured Streaming 如何在Spark结构化流中使用foreach方法将数据插入HIVE - how to insert data to HIVE using foreach method in spark structured streaming 无法使用Kafka-Spark结构化流将数据发送到MongoDB - Unable to send data to MongoDB using Kafka-Spark Structured Streaming 如何在Windows中使用TCP套接字发送数据以进行Spark结构化流传输 - How to send data using TCP socket in windows for Spark Structured Streaming 使用结构化 Spark Streaming 在 HBase 中批量插入数据 - Bulk Insert Data in HBase using Structured Spark Streaming
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM