使用Mongodb的Spark非常慢

Question

I am running the spark-shell with mongodb connector. 我用mongodb连接器运行spark-shell。 But the program was very slow , i think i will don't have the response from program. 但是程序非常慢，我想我不会得到程序的响应。

My spark-shell command is : 我的spark-shell命令是：

./spark-shell --master spark://spark_host:7077 \
--conf "spark.mongodb.input.uri=mongodb://mongod_user:password@mongod_host:27017/database.collection?readPreference=primaryPreferred" \
--jars /mongodb/lib/mongo-spark-connector_2.10-2.0.0.jar,/mongodb/lib/bson-3.2.2.jar,/mongodb/lib/mongo-java-driver-3.2.2.jar

And my app code is : 我的应用代码是：

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import com.mongodb.spark._
import org.bson.Document
import com.mongodb.spark.config.ReadConfig
import org.apache.spark.sql.SparkSession
import com.mongodb.spark.rdd.MongoRDD

val sparkSession = SparkSession.builder().getOrCreate()
val df = MongoSpark.load(sparkSession)
val dataset = df.filter("thisRequestTime > 1499250131596")
dataset.first // will wait to long time

What thing i was missed ? 我错过了什么？ Help me please ~ PS: my spark is standalone model . 请帮帮我~PS：我的火花是独立模特。 App dependency is : 应用依赖性是：

<properties>
        <encoding>UTF-8</encoding>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <scala.compat.version>2.11</scala.compat.version>
        <spark.version>2.1.1</spark.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_${scala.compat.version}</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_${scala.compat.version}</artifactId>
            <version>${spark.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.mongodb.spark</groupId>
            <artifactId>mongo-spark-connector_${scala.compat.version}</artifactId>
            <version>2.0.0</version>
        </dependency>
    </dependencies>

Answer 1

I've been trapped into this kind of problem for a while, but got it over at last. 我有一段时间陷入这种问题，但终于把它弄好了。 I don't know the detail of your Mongodb configuration, but here is my solution for my problem , hope you find it helpful. 我不知道你的Mongodb配置的细节， 但这是我的问题的解决方案 ，希望你发现它有用。

My dataset is huge, too. 我的数据集也很庞大。 So I configured a sharded cluster for mongodb, that's why make it slow. 所以我为mongodb配置了一个分片群集，这就是为什么让它变慢。 To solve it, add one piece of conf spark.mongodb.input.partitioner=MongoShardedPartitioner . 要解决这个问题，请添加一个conf spark.mongodb.input.partitioner=MongoShardedPartitioner 。 Otherwise a default partition policy with be put into use, which is not suitable for a sharded mongodb. 否则，将使用默认分区策略，这不适用于分片mongodb。

You can find more specific information here 您可以在此处找到更具体的信息

good luck! 祝好运！

使用Mongodb的Spark非常慢

问题描述

1 个解决方案

解决方案1
1 2018-01-09 14:19:36

使用Mongodb的Spark非常慢

问题描述

1 个解决方案

解决方案1 1 2018-01-09 14:19:36

解决方案1
1 2018-01-09 14:19:36