简体   繁体   English

将Spark的流数据Stream加载到MongoDB中

[英]Load Streaming Data Stream of Spark into MongoDB

I am working on a project where I have the following data pipeline:我正在从事一个项目,我有以下数据管道:

Twitter → Tweepy API (Stream) → Kafka → Spark (Real-Time Sentiment Analysis) → MongoDB → Tableau Twitter → Tweepy API(流)→ Kafka → Spark(实时情感分析)→ MongoDB → Tableau

I was able to get tweets stream using Tweepy into Kafka Producer and from Producer into Kafka Consumer.我能够使用 Tweepy 将推文 stream 发送到 Kafka Producer,并从 Producer 发送到 Kafka Consumer。 I then used the Twitter Stream in Kafka Consumer as the data source, I created a “Streaming Data Frame” in Spark (PySpark), performed real-time pre-processing & sentiment analysis, the resultant “Streaming Data Frame” needs to go into MongoDB, this is where the problem lies.然后我使用Kafka Consumer中的Twitter Stream作为数据源,在Spark(PySpark)中创建了一个“Streaming Data Frame”,进行了实时的预处理&情感分析,得到的“Streaming Data Frame”需要go into MongoDB,这就是问题所在。

I am able to write “static” PySpark Data Frame into MongoDB, but not the streaming Data Frame.我可以将“静态”PySpark 数据帧写入 MongoDB,但不能将流式数据帧写入。

Details are below:详情如下:

mongo_conn = "mongodb+srv://<username>:<password>@cluster0.afic7p0.mongodb.net/?retryWrites=true&w=majority"
conf = SparkConf()
# Download mongo-spark-connector and its dependencies.
conf.set("spark.jars.packages","org.mongodb.spark:mongo-spark-connector:10.0.5")
conf.set("spark.jars.packages","org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1")
 # Set up read connection :
conf.set("spark.mongodb.read.connection.uri", mongo_conn)
conf.set("spark.mongodb.read.database", "mySecondDataBase")
conf.set("spark.mongodb.read.collection", "TwitterStreamv2")
 # Set up write connection
conf.set("spark.mongodb.write.connection.uri", mongo_conn)
conf.set("spark.mongodb.write.database", "mySecondDataBase")
conf.set("spark.mongodb.write.collection", "TwitterStreamv2")
SparkContext.getOrCreate(conf=conf)
spark = SparkSession.builder.appName("myApp").getOrCreate()

Reading Kafka Data Frame (Streaming)读取 Kafka 数据帧(流式)

df = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", "localhost:9092") \
        .option("startingOffsets", "earliest") \
        .option("kafka.group.id", "group1") \
        .option("subscribe", "twitter") \
        .load()

Skipping Pre-Processing & Sentiment Analysis Code跳过预处理和情感分析代码

Writing Data Stream to MongoDB写入数据 Stream 到 MongoDB

def write_row(batch_df , batch_id):
    batch_df.write.format("mongodb").mode("append").save()
    pass

sentiment_tweets.writeStream.foreachBatch(write_row).start().awaitTermination()

Where "sentiment_tweets" is the resultant Streaming Data Frame.其中“sentiment_tweets”是生成的流数据帧。 The code above doesn't work.上面的代码不起作用。

ERROR:py4j.clientserver:There was an exception while executing the Python Proxy on the Python Side.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/py4j/clientserver.py", line 617, in _call_proxy
    return_value = getattr(self.pool[obj_id], method)(*params)
  File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py", line 272, in call
    raise e
  File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py", line 269, in call
    self.func(DataFrame(jdf, self.session), batch_id)
  File "<ipython-input-34-a3fa83af6c03>", line 2, in write_row
    batch_df.write.format("mongodb").mode("append").save()
  File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/readwriter.py", line 966, in save
    self._jwrite.save()
  File "/usr/local/lib/python3.7/dist-packages/py4j/java_gateway.py", line 1322, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/usr/local/lib/python3.7/dist-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o159.save.
: java.lang.ClassNotFoundException: 
Failed to find data source: mongodb. Please find packages at
https://spark.apache.org/third-party-projects.html
       
    at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:587)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:675)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:725)
    at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:864)
    at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:247)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)

Note: I am coming here after going through: Unable to send data to MongoDB using Kafka-Spark Structured Streaming注意:我是在经历过之后才来到这里的: Unable to send data to MongoDB using Kafka-Spark Structured Streaming

Failed to find data source: mongodb找不到数据源:mongodb

spark.jars.packages takes a comma separated list. spark.jars.packages采用逗号分隔列表。 If you set it twice (to add Kafka libraries), you're overriding it.如果你设置它两次(添加 Kafka 库),你就会覆盖它。

You should also use SparkSession, not SparkConf/SparkContext您还应该使用 SparkSession,而不是 SparkConf/SparkContext

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM