简体   繁体   English

如何使用python中的结构化火花流使用ForeachWriter将行插入到Mongodb中?

[英]How to use structured spark streaming in python to insert row into Mongodb using ForeachWriter?

I'm using spark streaming to read data from kafka and insert that into mongodb.我正在使用火花流从 kafka 读取数据并将其插入到 mongodb 中。 I'm using pyspark 2.4.4.我正在使用 pyspark 2.4.4。 I'm trying to make use of ForeachWriter because just using for each method means the connection will establishing for every row.我正在尝试使用 ForeachWriter 因为仅使用 for each 方法意味着将为每一行建立连接。

    def open(self, partition_id, epoch_id):
        # Open connection. This method is optional in Python.
        self.connection = MongoClient("192.168.0.100:27017")
        self.db = self.connection['test']
        self.coll = self.db['output']    
        print(epoch_id)
        pass

    def process(self, row):
        # Write row to connection. This method is NOT optional in Python.
        #self.coll=None -> used this to test, if I'm getting an exception if it is there but I'm not getting one
        self.coll.insert_one(row.asDict())
        pass

    def close(self, error):
        # Close the connection. This method in optional in Python.
        print(error)
        pass

df_w=df7\
        .writeStream\
        .foreach(ForeachWriter())\
        .trigger(processingTime='1 seconds') \
        .outputMode("update") \
        .option("truncate", "false")\
        .start()df_w=df7\
        .writeStream\
        .foreach(ForeachWriter())\
        .trigger(processingTime='1 seconds') \
        .outputMode("update") \
        .option("truncate", "false")\
        .start()

My problem it's not inserting to mongodb and I can't find solutions for this.我的问题不是插入到 mongodb,我找不到解决方案。 If comment it out I'll get error.如果注释掉,我会得到错误。 But process method is not executing.但是 process 方法没有执行。 any one have any ideas?谁有想法?

You set up the collection to None in the first line of the process function.您在 process 函数的第一行将集合设置为None Therefore you insert the row into nowhere.因此,您将该行插入到任何地方。 Also, I don't know if it just here, or in your code as well, but you have the writeStream part two times.另外,我不知道它是在这里,还是在您的代码中,但是您有两次 writeStream 部分。

This is probably not documented in spark docs.这可能没有记录在 spark 文档中。 But if you look at the definition of foreach in pyspark, it has the following line of code:但是如果你看一下pyspark中foreach的定义,它有下面这行代码:

# Check if the data should be processed
  should_process = True
  if open_exists:
    should_process = f.open(partition_id, epoch_id)

Therefore, whenever we open a new connection, the open must return True.因此,每当我们打开一个新连接时,open 必须返回 True。 In actual documentation, they have used 'pass' which results in 'process()' never getting called.在实际文档中,他们使用了“pass”,这导致“process()”永远不会被调用。 (This answer is for future reference for anybody facing the same issue.) (此答案供将来遇到相同问题的任何人参考。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 无法使用Kafka-Spark结构化流将数据发送到MongoDB - Unable to send data to MongoDB using Kafka-Spark Structured Streaming 如何为MongoDB接收器构建用于Spark Structured Streaming应用程序的超级jar - How to build uber jar for Spark Structured Streaming application to MongoDB sink 如何将 Kafka 与 Spark Structured Streaming 与 MongoDB Sink 集成 - How to Integrate Kafka with Spark Structured Streaming with MongoDB Sink 使用Spark和Kafka进行Twitter流式传输:如何在MongoDB中存储数据 - Twitter streaming using spark and kafka: How store the data in MongoDB 使用带有结构化文本编程的套接字将数据流传输到MongoDB - Data streaming to MongoDB using sockets with Structured Text programming 如何处理 JSON 文档(来自 MongoDB)并在结构化流中写入 HBase? - How to process JSON documents (from MongoDB) and write to HBase in Structured Streaming? 使用 PySpark 结构化流将 Kafka Stream 接收到 MongoDB - Sink Kafka Stream to MongoDB using PySpark Structured Streaming 如何将流数据从spark下沉到Mongodb? - How to sink streaming data from spark to Mongodb? 使用 Spark Scala 在 MongoDB 中保存流式 dataframe - Save Streaming dataframe in MongoDB using Spark Scala 使用MongoDB后端的Spark Streaming - Spark Streaming with MongoDB backend
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM