[英]How to use structured spark streaming in python to insert row into Mongodb using ForeachWriter?
I'm using spark streaming to read data from kafka and insert that into mongodb.我正在使用火花流从 kafka 读取数据并将其插入到 mongodb 中。 I'm using pyspark 2.4.4.
我正在使用 pyspark 2.4.4。 I'm trying to make use of ForeachWriter because just using for each method means the connection will establishing for every row.
我正在尝试使用 ForeachWriter 因为仅使用 for each 方法意味着将为每一行建立连接。
def open(self, partition_id, epoch_id):
# Open connection. This method is optional in Python.
self.connection = MongoClient("192.168.0.100:27017")
self.db = self.connection['test']
self.coll = self.db['output']
print(epoch_id)
pass
def process(self, row):
# Write row to connection. This method is NOT optional in Python.
#self.coll=None -> used this to test, if I'm getting an exception if it is there but I'm not getting one
self.coll.insert_one(row.asDict())
pass
def close(self, error):
# Close the connection. This method in optional in Python.
print(error)
pass
df_w=df7\
.writeStream\
.foreach(ForeachWriter())\
.trigger(processingTime='1 seconds') \
.outputMode("update") \
.option("truncate", "false")\
.start()df_w=df7\
.writeStream\
.foreach(ForeachWriter())\
.trigger(processingTime='1 seconds') \
.outputMode("update") \
.option("truncate", "false")\
.start()
My problem it's not inserting to mongodb and I can't find solutions for this.我的问题不是插入到 mongodb,我找不到解决方案。 If comment it out I'll get error.
如果注释掉,我会得到错误。 But process method is not executing.
但是 process 方法没有执行。 any one have any ideas?
谁有想法?
You set up the collection to None
in the first line of the process function.您在 process 函数的第一行将集合设置为
None
。 Therefore you insert the row into nowhere.因此,您将该行插入到任何地方。 Also, I don't know if it just here, or in your code as well, but you have the writeStream part two times.
另外,我不知道它是在这里,还是在您的代码中,但是您有两次 writeStream 部分。
This is probably not documented in spark docs.这可能没有记录在 spark 文档中。 But if you look at the definition of foreach in pyspark, it has the following line of code:
但是如果你看一下pyspark中foreach的定义,它有下面这行代码:
# Check if the data should be processed
should_process = True
if open_exists:
should_process = f.open(partition_id, epoch_id)
Therefore, whenever we open a new connection, the open must return True.因此,每当我们打开一个新连接时,open 必须返回 True。 In actual documentation, they have used 'pass' which results in 'process()' never getting called.
在实际文档中,他们使用了“pass”,这导致“process()”永远不会被调用。 (This answer is for future reference for anybody facing the same issue.)
(此答案供将来遇到相同问题的任何人参考。)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.