简体繁体中英

apache spark stand alone connecting to mongodb with scala using casbah

原文 2015-02-02 19:42:51 2 1 mongodb/ scala/ apache-spark/ casbah

i would like to perform a Apache Spark map-reduce on 5 files and output them to mongodb. I would prefer not using HDFS since NameNodes are a single point of failure ( http://wiki.apache.org/hadoop/NameNode ).

A. Is it possilbe to read multiple files in RDD, perform a map reduction on a key from all the files and use the casbah toolkit to output the results to mongodb

B. Is it possible to use the client to read from mongodb into RDD, perform a map reduce and right output back to mongodb using the casbah toolkit

C. Is it possible to read multiple files in RDD, map them with keys that exist in mongodb, reduce them to a single document and insert them back into mongodb

I know all of this is possible using the mongo-hadoop connector. I just dont like the idea of using HDFS since it is a single point of failure and backUpNameNodes are not implemented yet.

Ive read some things on line but they are not clear.

MongoDBObject not being added to inside of an rrd foreach loop casbah scala apache spark

Not sure whats going on there. The JSON does not even appear to be valid...

resources:

https://github.com/mongodb/casbah

http://docs.mongodb.org/ecosystem/drivers/scala/

1 answers

Yes. I haven't used MongoDB, but based on other things I've done with Spark, these should all be quite possible.

However, do keep in mind that a Spark application is not typically fault-tolerant. The application (aka "driver") itself is a single point of failure. There's a related question on that topic ( Resources/Documentation on how does the failover process work for the Spark Driver (and its YARN Container) in yarn-cluster mode ), but I think it doesn't have a really good answer at the moment.

I have no experience running a critical HDFS cluster, so I don't know how much of a problem the single point of failure is. But another idea may be running on top of Amazon S3 or Google Cloud Storage. I would expect these to be way more reliable than anything you can cook up. They have large support teams and lots of money and expertise invested.

Compilation error on MongoDB Casbah for Scala

MongoDB Insert Behavior with Casbah and Scala

Scala: Example of using Casbah to write / update / delete objects in MongoDB?

Insert new record using Scala Salat/Casbah and Mongodb

Scala - Get Last Inserted ObjectId Using Casbah MongoDB

Extracting value from MongoDB DBObject using Scala/Casbah

Repair mongodb stand alone in kubernetes

MongoDBObject not being added to inside of an rrd foreach loop casbah scala apache spark

Casbah Scala MongoDB driver - embedded objects

Multiple documents update mongodb casbah scala

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Compilation error on MongoDB Casbah for Scala MongoDB Insert Behavior with Casbah and Scala Scala: Example of using Casbah to write / update / delete objects in MongoDB? Insert new record using Scala Salat/Casbah and Mongodb Scala - Get Last Inserted ObjectId Using Casbah MongoDB Extracting value from MongoDB DBObject using Scala/Casbah Repair mongodb stand alone in kubernetes MongoDBObject not being added to inside of an rrd foreach loop casbah scala apache spark Casbah Scala MongoDB driver - embedded objects Multiple documents update mongodb casbah scala

Related Tags

apache spark stand alone connecting to mongodb with scala using casbah

Question

1 answers

solution1 1 ACCPTED 2015-02-02 21:28:25

solution1
1 ACCPTED 2015-02-02 21:28:25