Spark rdd.count() yields inconsistent results

Question

I'm a bit baffled.

A simple rdd.count() gives different results when run multiple times.

Here is the code i run:

val inputRdd = sc.newAPIHadoopRDD(inputConfig,
classOf[com.mongodb.hadoop.MongoInputFormat],
classOf[Long],
classOf[org.bson.BSONObject])

println(inputRdd.count())

It opens a connection to a MondoDb Server and simply counts the Objects. Seems pretty straight forward to me

According to MongoDb there are 3,349,495 entries

Here is my spark output, all ran the same jar:

spark1 :    3.257.048  
spark2 :    3.303.272  
spark3 :    3.303.272  
spark4 :    3.303.272  
spark5 :    3.303.271   
spark6 :    3.303.271  
spark7 :    3.303.272  
spark8 :    3.303.272  
spark9 :    3.306.300  
spark10:    3.303.272  
spark11:    3.303.271

Spark and MongoDb are run on the same cluster.
We are running:

Spark version 1.5.0-cdh5.6.1  
Scala version 2.10.4  
MongoDb version 2.6.12

Unfortunately we can not update these

Is Spark non-deterministic?
Is there anyone who can enlighten me?

Thanks in advance

EDIT/ Further Info
I just noticed an error in our mongod.log. Could this error cause the inconsistent behaviour?

[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet syncing to: hadoop05:27017
[rsBackgroundSync] replSet not trying to sync from hadoop05:27017, it is vetoed for 600 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop05:27017, it is vetoed for 600 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet error RS102 too stale to catch up, at least from hadoop05:27017
[rsBackgroundSync] replSet our last optime : Jul  2 10:19:44 57777920:111
[rsBackgroundSync] replSet oldest at hadoop05:27017 : Jul  5 15:17:58 577bb386:59
[rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
[rsBackgroundSync] replSet error RS102 too stale to catch up

Answer 1

As you already spotted, the problem does not appear to be with spark (or scala) but with MongoDB.

As such the question regarding the difference seems to be resolved.

You will still want to troubleshoot the actual MongoDB error, the provided link can be a good starting point for that: http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember

Answer 2

count returns an estimated count. As such, the value returned can change even if the number of documents hasn't changed.

countDocuments was added to MongoDB 4.0 to provide an accurate count (that also works in multi-document transactions).

Spark rdd.count() yields inconsistent results

Question

2 answers

solution1
0 2017-07-31 15:06:33

solution2
0 2020-11-27 02:08:11

Spark rdd.count() yields inconsistent results

Question

2 answers

solution1 0 2017-07-31 15:06:33

solution2 0 2020-11-27 02:08:11

solution1
0 2017-07-31 15:06:33

solution2
0 2020-11-27 02:08:11