I'm a bit baffled.
A simple rdd.count() gives different results when run multiple times.
Here is the code i run:
val inputRdd = sc.newAPIHadoopRDD(inputConfig,
classOf[com.mongodb.hadoop.MongoInputFormat],
classOf[Long],
classOf[org.bson.BSONObject])
println(inputRdd.count())
It opens a connection to a MondoDb Server and simply counts the Objects. Seems pretty straight forward to me
According to MongoDb there are 3,349,495 entries
Here is my spark output, all ran the same jar:
spark1 : 3.257.048
spark2 : 3.303.272
spark3 : 3.303.272
spark4 : 3.303.272
spark5 : 3.303.271
spark6 : 3.303.271
spark7 : 3.303.272
spark8 : 3.303.272
spark9 : 3.306.300
spark10: 3.303.272
spark11: 3.303.271
Spark and MongoDb are run on the same cluster.
We are running:
Spark version 1.5.0-cdh5.6.1
Scala version 2.10.4
MongoDb version 2.6.12
Unfortunately we can not update these
Is Spark non-deterministic?
Is there anyone who can enlighten me?
Thanks in advance
EDIT/ Further Info
I just noticed an error in our mongod.log. Could this error cause the inconsistent behaviour?
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet syncing to: hadoop05:27017
[rsBackgroundSync] replSet not trying to sync from hadoop05:27017, it is vetoed for 600 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop05:27017, it is vetoed for 600 more seconds
[rsBackgroundSync] replSet not trying to sync from hadoop04:27017, it is vetoed for 333 more seconds
[rsBackgroundSync] replSet error RS102 too stale to catch up, at least from hadoop05:27017
[rsBackgroundSync] replSet our last optime : Jul 2 10:19:44 57777920:111
[rsBackgroundSync] replSet oldest at hadoop05:27017 : Jul 5 15:17:58 577bb386:59
[rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
[rsBackgroundSync] replSet error RS102 too stale to catch up
As you already spotted, the problem does not appear to be with spark (or scala) but with MongoDB.
As such the question regarding the difference seems to be resolved.
You will still want to troubleshoot the actual MongoDB error, the provided link can be a good starting point for that: http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
count
returns an estimated count. As such, the value returned can change even if the number of documents hasn't changed.
countDocuments was added to MongoDB 4.0 to provide an accurate count (that also works in multi-document transactions).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.