簡體   English   中英

如何將MongoDB數據上創建的org.apache.spark.sql.DataFrame中的數據保存回MongoDB?

[英]How to save data from org.apache.spark.sql.DataFrame created on MongoDB data back to MongoDB?

有一些方法可以將org.apache.spark.sql.DataFrame數據保存到文件系統或Hive。 但是如何將MongoDB數據上創建的DataFrame數據保存回MongoDB?

編輯 :我使用創建了DataFrame

SparkContext sc = new SparkContext()
Configuration config = new Configuration();
config.set("mongo.input.uri","mongodb://localhost:27017:testDB.testCollection);
JavaRDD<Tuple2<Object, BSONObject>> mongoJavaRDD = sc.newAPIHadoopRDD(config, MongoInputFormat.class, Object.class,
            BSONObject.class).toJavaRDD();
JavaRDD<Object> mongoRDD = mongoJavaRDD.flatMap(new FlatMapFunction<Tuple2<Object, BSONObject>, Object>()
    {
        @Override
        public Iterable<Object> call(Tuple2<Object, BSONObject> arg)
        {
            BSONObject obj = arg._2();
            Object javaObject = generateJavaObjectFromBSON(obj, clazz);
            return Arrays.asList(javaObject);
        }
    });

sqlContext = new SqlContext(sc);
 DataFrame df = sqlContext.createDataFrame(mongoRDD, Person.class).registerTempTable("Person");

使用PySpark並假設您有一個本地MongoDB實例:

import pymongo
from toolz import dissoc

# First, lets create some dummy collection
client = pymongo.MongoClient()
client["foo"]["bar"].insert([{"k": "foo", "v": 1}, {"k": "bar", "v": 2}])
client.close()

config = {
    "mongo.input.uri": "mongodb://localhost:27017/foo.bar",
    "mongo.output.uri": "mongodb://localhost:27017/foo.barplus"
}

# Read data from MongoDB
rdd = sc.newAPIHadoopRDD(
    "com.mongodb.hadoop.MongoInputFormat",
     "org.apache.hadoop.io.Text",
     "org.apache.hadoop.io.MapWritable",
     None, None, config)

# Drop _id field and create data frame
dt = sqlContext.createDataFrame(rdd.map(lambda (k, v): dissoc(v, "_id")))
dt_plus_one = dt.select(dt["k"], (dt["v"] + 1).alias("v"))

(dt_plus_one.
   rdd. # Extract rdd
   map(lambda row: (None, row.asDict())). # Map to (None, dict) pairs
   saveAsNewAPIHadoopFile(
       "file:///placeholder", # Ignored
       # From org.mongodb.mongo-hadoop:mongo-hadoop-core
       "com.mongodb.hadoop.MongoOutputFormat", 
        None, None, None, None, config))

另請參閱:使Spark,Python和MongoDB協同工作

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM