I'm trying to write from a Spark RDD to MongoDB using the mongo-spark-connector.
I'm facing two problems
Environment:
First I'll setup a dummy example to demonstrate...
import org.bson.Document
import com.mongodb.spark.MongoSpark
import com.mongodb.spark.config.WriteConfig
import org.apache.spark.rdd.RDD
/**
* fake json data
*/
val recs: List[String] = List(
"""{"a": 123, "b": 456, "c": "apple"}""",
"""{"a": 345, "b": 72, "c": "banana"}""",
"""{"a": 456, "b": 754, "c": "cat"}""",
"""{"a": 876, "b": 43, "c": "donut"}""",
"""{"a": 432, "b": 234, "c": "existential"}"""
)
val rdd_json_str: RDD[String] = sc.parallelize(recs, 5)
val rdd_hex_bson: RDD[Document] = rdd_json_str.map(json_str => Document.parse(json_str))
Some values that won't change...
// credentials
val user = ???
val pwd = ???
// fixed values
val db = "db_name"
val replset = "replset_name"
val collection_name = "collection_name"
Here's what does NOT work... in this case "url" would look something like machine.unix.domain.org
and "ip" would look like... well, an IP address.
This is how the documentation says to define the host... with every machine in the replica set.
val host = "url1:27017,url2:27017,url3:27017"
val host = "ip_address1:27017,ip_address2:27017,ip_address3:27017"
I can't get either of these to work. Using every permutation I can think of for the uri...
val uri = s"mongodb://${user}:${pwd}@${host}/${db}?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}@${host}/?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}@${replset}/${host}/${db}"
val uri = s"mongodb://${user}:${pwd}@${replset}/${host}/${db}.${collection_name}"
val uri = s"mongodb://${user}:${pwd}@${host}" // setting db, collection, replica set in WriteConfig
val uri = s"mongodb://${user}:${pwd}@${host}/${db}" // this works IF HOST IS PRIMARY ONLY; not for hosts as defined above
EDIT more detail on the error messages.. the errors take to forms...
form 1
typically includes java.net.UnknownHostException: machine.unix.domain.org
also, comes back with server addresses in url form even when defined as IP addresses
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting
for a server that matches WritableServerSelector. Client view of cluster
state is {type=REPLICA_SET, servers=[{address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017,
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException:
machine.unix.domain.org}, caused by {java.net.UnknownHostException:
machine.unix.domain.org}}]
form 2
(authentication error... though connecting with same credentials to primary only works fine)
com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting
for a server that matches WritableServerSelector. Client view of cluster
state is {type=REPLICA_SET, servers=[{address=xx.xx.xx.xx:27017,
type=UNKNOWN, state=CONNECTING, exception=
{com.mongodb.MongoSecurityException: Exception authenticating
MongoCredential{mechanism=null, userName='xx', source='admin', password=
<hidden>, mechanismProperties={}}}, caused by
{com.mongodb.MongoCommandException: Command failed with error 18:
'Authentication failed.' on server xx.xx.xx.xx:27017. The full response is {
"ok" : 0.0, "errmsg" : "Authentication failed.", "code" : 18, "codeName" :
"AuthenticationFailed", "operationTime" : { "$timestamp" : { "t" :
1534459121, "i" : 1 } }, "$clusterTime" : { "clusterTime" : { "$timestamp" :
{ "t" : 1534459121, "i" : 1 } }, "signature" : { "hash" : { "$binary" :
"xxx=", "$type" : "0" }, "keyId" : { "$numberLong" : "123456" } } } }}}...
end EDIT
here's what DOES work... on the dummy data only... more on that below...
val host = s"${primary_ip_address}:27017" // primary only
val uri = s"mongodb://${user}:${pwd}@${host}/${db}"
val writeConfig: WriteConfig =
WriteConfig(Map(
"uri" -> uri,
"database" -> db,
"collection" -> collection_name,
"replicaSet" -> replset))
// write data to mongo
MongoSpark.save(rdd_hex_bson, writeConfig)
This... connecting to primary only... works great for dummy data, but crashes the primary for real data (50 - 100GB from and RDD with 2700 partitions). My guess is that it opens up too many connections at once... it looks like it opens ~900 connections to write (this jives since default parallelism 2700 based on 900 virtual cores and parellelism factor of 3x).
I'm guessing if I repartition so it opens fewer connections, I'll have better luck... but I'm guessing this also ties in to writing to the primary only instead of spreading it over all instances.
I've read everything I can find here... but most examples are for single instance connections... https://docs.mongodb.com/spark-connector/v1.1/configuration/#output-configuration
It turns out there were two problems here. From the original question, these were referenced as errors of 'form 1' and 'form 2'.
error of 'form 1' - solution
The gist of the problem turned out to be a bug in the mongo-spark-connector. It turns out that it can't connect to a replica set using IP addresses... it requires URIs. Since the DNS servers in our cloud don't have these lookups, I got it working by modifying /etc/hosts
on every executor and then using the connection string format like this:
val host = "URI1:27017,URI2:27017,URI3:27017"
val uri = s"mongodb://${user}:${pwd}@${host}/${db}?replicaSet=${replset}&authSource=${db}"
val writeConfig: WriteConfig =
WriteConfig(Map(
"uri"->uri,
"database"->db,
"collection"->collection,
"replicaSet"->replset,
"writeConcern.w"->"majority"))
this required first adding the following to /etc/hosts
on every machine:
IP1 URI1
IP2 URI2
IP3 URI3
Now of course, i can't figure out how to use bootstrap actions in AWS EMR to update /etc/hosts
when the cluster spins up. But that's another question. ( AWS EMR bootstrap action as sudo )
error of 'form 2' - solution
adding &authSource=${db}
to the uri solved this.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.