简体   繁体   中英

writing to Mongo replica set from Spark (in scala)

I'm trying to write from a Spark RDD to MongoDB using the mongo-spark-connector.

I'm facing two problems

  • [main problem] I can't connect to Mongo if I define the host according to the documentation (using all instances in the mongo replica set)
  • [secondary/related problem] If I connect to the primary only, I can write... but I typically crash the primary writing the first collection

Environment:

  • mongo-spark-connector 1.1
  • spark 1.6
  • scala 2.10.5

First I'll setup a dummy example to demonstrate...

import org.bson.Document 
import com.mongodb.spark.MongoSpark 
import com.mongodb.spark.config.WriteConfig

import org.apache.spark.rdd.RDD

/** 
  * fake json data
  */

val recs: List[String] = List(
  """{"a": 123, "b": 456, "c": "apple"}""",
  """{"a": 345, "b":  72, "c": "banana"}""",
  """{"a": 456, "b": 754, "c": "cat"}""",
  """{"a": 876, "b":  43, "c": "donut"}""",
  """{"a": 432, "b": 234, "c": "existential"}"""
)

val rdd_json_str: RDD[String] = sc.parallelize(recs, 5)
val rdd_hex_bson: RDD[Document] = rdd_json_str.map(json_str => Document.parse(json_str))

Some values that won't change...

// credentials
val user = ???
val pwd  = ???

// fixed values
val db              = "db_name"
val replset         = "replset_name"
val collection_name = "collection_name"

Here's what does NOT work... in this case "url" would look something like machine.unix.domain.org and "ip" would look like... well, an IP address.

This is how the documentation says to define the host... with every machine in the replica set.

val host = "url1:27017,url2:27017,url3:27017"
val host = "ip_address1:27017,ip_address2:27017,ip_address3:27017"

I can't get either of these to work. Using every permutation I can think of for the uri...

val uri = s"mongodb://${user}:${pwd}@${host}/${db}?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}@${host}/?replicaSet=${replset}"
val uri = s"mongodb://${user}:${pwd}@${replset}/${host}/${db}"
val uri = s"mongodb://${user}:${pwd}@${replset}/${host}/${db}.${collection_name}"
val uri = s"mongodb://${user}:${pwd}@${host}"       // setting db, collection, replica set in WriteConfig
val uri = s"mongodb://${user}:${pwd}@${host}/${db}" // this works IF HOST IS PRIMARY ONLY; not for hosts as defined above

EDIT more detail on the error messages.. the errors take to forms...

form 1

typically includes java.net.UnknownHostException: machine.unix.domain.org

also, comes back with server addresses in url form even when defined as IP addresses

com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting 
for a server that matches WritableServerSelector. Client view of cluster 
state is {type=REPLICA_SET, servers=[{address=machine.unix.domain.org:27017, 
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: 
machine.unix.domain.org}, caused by {java.net.UnknownHostException: 
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017, 
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: 
machine.unix.domain.org}, caused by {java.net.UnknownHostException: 
machine.unix.domain.org}}, {address=machine.unix.domain.org:27017, 
type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: 
machine.unix.domain.org}, caused by {java.net.UnknownHostException: 
machine.unix.domain.org}}]

form 2

(authentication error... though connecting with same credentials to primary only works fine)

com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting 
for a server that matches WritableServerSelector. Client view of cluster  
state is {type=REPLICA_SET, servers=[{address=xx.xx.xx.xx:27017,  
type=UNKNOWN, state=CONNECTING, exception= 
{com.mongodb.MongoSecurityException: Exception authenticating  
MongoCredential{mechanism=null, userName='xx', source='admin', password= 
<hidden>, mechanismProperties={}}}, caused by  
{com.mongodb.MongoCommandException: Command failed with error 18:  
'Authentication failed.' on server xx.xx.xx.xx:27017. The full response is {  
"ok" : 0.0, "errmsg" : "Authentication failed.", "code" : 18, "codeName" :  
"AuthenticationFailed", "operationTime" : { "$timestamp" : { "t" :  
1534459121, "i" : 1 } }, "$clusterTime" : { "clusterTime" : { "$timestamp" :  
{ "t" : 1534459121, "i" : 1 } }, "signature" : { "hash" : { "$binary" :  
"xxx=", "$type" : "0" }, "keyId" : { "$numberLong" : "123456" } } } }}}...

end EDIT

here's what DOES work... on the dummy data only... more on that below...

val host = s"${primary_ip_address}:27017" // primary only
val uri = s"mongodb://${user}:${pwd}@${host}/${db}"

val writeConfig: WriteConfig = 
  WriteConfig(Map(
    "uri"        -> uri, 
    "database"   -> db, 
    "collection" -> collection_name, 
    "replicaSet" -> replset))

// write data to mongo
MongoSpark.save(rdd_hex_bson, writeConfig)

This... connecting to primary only... works great for dummy data, but crashes the primary for real data (50 - 100GB from and RDD with 2700 partitions). My guess is that it opens up too many connections at once... it looks like it opens ~900 connections to write (this jives since default parallelism 2700 based on 900 virtual cores and parellelism factor of 3x).

I'm guessing if I repartition so it opens fewer connections, I'll have better luck... but I'm guessing this also ties in to writing to the primary only instead of spreading it over all instances.

I've read everything I can find here... but most examples are for single instance connections... https://docs.mongodb.com/spark-connector/v1.1/configuration/#output-configuration

It turns out there were two problems here. From the original question, these were referenced as errors of 'form 1' and 'form 2'.

error of 'form 1' - solution

The gist of the problem turned out to be a bug in the mongo-spark-connector. It turns out that it can't connect to a replica set using IP addresses... it requires URIs. Since the DNS servers in our cloud don't have these lookups, I got it working by modifying /etc/hosts on every executor and then using the connection string format like this:

val host = "URI1:27017,URI2:27017,URI3:27017"

val uri  = s"mongodb://${user}:${pwd}@${host}/${db}?replicaSet=${replset}&authSource=${db}"

val writeConfig: WriteConfig = 
  WriteConfig(Map(
    "uri"->uri, 
    "database"->db, 
    "collection"->collection, 
    "replicaSet"->replset, 
    "writeConcern.w"->"majority"))

this required first adding the following to /etc/hosts on every machine:

IP1 URI1
IP2 URI2
IP3 URI3

Now of course, i can't figure out how to use bootstrap actions in AWS EMR to update /etc/hosts when the cluster spins up. But that's another question. ( AWS EMR bootstrap action as sudo )

error of 'form 2' - solution

adding &authSource=${db} to the uri solved this.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM