简体   繁体   English

如何将流数据从spark下沉到Mongodb?

[英]How to sink streaming data from spark to Mongodb?

I am using pyspark to read streaming data from Kafka and then I want to sink that data to mongodb. I have included all the required packages, but it throws the error that我正在使用 pyspark 从 Kafka 读取流数据,然后我想将该数据接收到 mongodb。我已经包含了所有必需的包,但它抛出了错误
UnsupportedOperationException: Data source com.mongodb.spark.sql.DefaultSource does not support streamed writing UnsupportedOperationException: 数据源 com.mongodb.spark.sql.DefaultSource 不支持流式写入

The following links are not related to my question以下链接与我的问题无关

Writing to mongoDB from Spark 从 Spark 写入 mongoDB

Spark to MongoDB via Mesos Spark 通过 Mesos 到 MongoDB

Here is the full error stack trace这是完整的错误堆栈跟踪

Traceback (most recent call last): File "/home/b3ds/kafka-spark.py", line 85, in.option("com.mongodb.spark.sql.DefaultSource","mongodb://localhost:27017/twitter.test")\ File "/home/b3ds/hdp/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 827, in start File "/home/b3ds/hdp/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in call File "/home/b3ds/hdp/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/home/b3ds/hdp/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o122.start.追溯(最近一次通话最后一次):文件“/home/b3ds/kafka-spark.py”,第 85 行,in.option(“com.mongodb.spark.sql.DefaultSource”,“mongodb://localhost:27017/ twitter.test")\ 文件“/home/b3ds/hdp/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py”,第 827 行,在启动文件“/home/b3ds/hdp/spark/ python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py”,第 1133 行,在调用文件“/home/b3ds/hdp/spark/python/lib/pyspark.zip/pyspark/sql/utils .py”,第 63 行,在装饰文件“/home/b3ds/hdp/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py”中,第 319 行,在 get_return_value py4j.protocol 中。 Py4JJavaError:调用 o122.start 时出错。 : java.lang.UnsupportedOperationException: Data source com.mongodb.spark.sql.DefaultSource does not support streamed writing at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:287) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:272) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.8821324694588 : java.lang.UnsupportedOperationException: Data source com.mongodb.spark.sql.DefaultSource does not support streamed writing at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:287) at org.apache.spark .sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:272) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine. java:357) 在 py4j.Gateway.invoke(Gateway.java:280) 在 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.8821324694588 8:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) 8:132) 在 py4j.commands.CallCommand.execute(CallCommand.java:79) 在 py4j.GatewayConnection.run(GatewayConnection.java:214) 在 java.lang.Thread.run(Thread.882185828694)

Here is my pyspark code这是我的 pyspark 代码

from __future__ import print_function
import sys
from pyspark.sql.functions import udf
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql.types import StructType
from pyspark.sql.types import *
import json
from pyspark.sql.functions import struct
from pyspark.sql.functions import *
import datetime

json_schema = StructType([
  StructField("twitterid", StringType(), True),
  StructField("created_at", StringType(), True),
  StructField("tweet", StringType(), True),
  StructField("screen_name", StringType(), True)
])

def parse_json(df):
    twitterid   = json.loads(df[0])['id']
    created_at  = json.loads(df[0])['created_at']
    tweet       = json.loads(df[0])['text']
    tweet       = json.loads(df[0])['text']
    screen_name = json.loads(df[0])['user']['screen_name']
    return [twitterid, created_at, tweet, screen_name]

def convert_twitter_date(timestamp_str):
    output_ts = datetime.datetime.strptime(timestamp_str.replace('+0000 ',''), '%a %b %d %H:%M:%S %Y')
    return output_ts

if __name__ == "__main__":

        spark = SparkSession\
                        .builder\
                        .appName("StructuredNetworkWordCount")\
                        .config("spark.mongodb.input.uri","mongodb://192.168.1.16:27017/twitter.test")\
                        .config("spark.mongodb.output.uri","mongodb://192.168.1.16:27017/twitter.test")\
                        .getOrCreate()
        events = spark\
                        .readStream\
                        .format("kafka")\
                        .option("kafka.bootstrap.servers", "localhost:9092")\
                        .option("subscribe", "twitter")\
                        .load()
        events = events.selectExpr("CAST(value as String)")

        udf_parse_json = udf(parse_json , json_schema)
        udf_convert_twitter_date = udf(convert_twitter_date, TimestampType())
        jsonoutput = events.withColumn("parsed_field", udf_parse_json(struct([events[x] for x in events.columns]))) \
                                        .where(col("parsed_field").isNotNull()) \
                                        .withColumn("created_at", col("parsed_field.created_at")) \
                                        .withColumn("screen_name", col("parsed_field.screen_name")) \
                                        .withColumn("tweet", col("parsed_field.tweet")) \
                                        .withColumn("created_at_ts", udf_convert_twitter_date(col("parsed_field.created_at")))

        windowedCounts = jsonoutput.groupBy(window(jsonoutput.created_at_ts, "1 minutes", "15 seconds"),jsonoutput.screen_name)$

        mongooutput = jsonoutput \
                        .writeStream \
                        .format("com.mongodb.spark.sql.DefaultSource")\
                        .option("com.mongodb.spark.sql.DefaultSource","mongodb://localhost:27017/twitter.test")\
                        .start()
        mongooutput.awaitTermination()

I have seen the mongodb documentation which says it supports spark to mongo sink我看过 mongodb 文档说它支持 spark to mongo sink

https://docs.mongodb.com/spark-connector/master/scala/streaming/ https://docs.mongodb.com/spark-connector/master/scala/streaming/

MongoDB Spark Connector now supports Apache Spark Structured Streaming. MongoDB Spark 连接器现在支持 Apache Spark 结构化流。

https://www.mongodb.com/docs/spark-connector/current/structured-streaming/ https://www.mongodb.com/docs/spark-connector/current/structured-streaming/

I have seen the mongodb documentation which says it supports spark to mongo sink 我看过mongodb文档,它说它支持向mongo接收器发送火花

What documentation claims, is that you can use standard RDD API to write each RDD using legacy Streaming ( DStream ) API. 文档声称,您可以使用标准RDD API来使用旧版Streaming( DStream )API编写每个RDD。

It doesn't suggest that MongoDB supports Structured Streaming, and it doesn't. 这并不意味着MongoDB支持结构化流,并且不支持。 Since you use PySpark, where forEach writer is not accessible, you'll have to wait, until (if ever) MongoDB package is updated to support streaming operations. 由于您使用的是无法访问forEach writer的 PySpark,因此必须等待,直到(如果有)MongoDB软件包已更新以支持流操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM