简体   繁体   English

使用 R 批量插入/更新到 mongoDB

[英]Bulk insert/update using R to mongoDB

I am trying to insert a lot of data (millions of documents) into mongodb using R from a variety of data frames which I will obtain at different times.我正在尝试使用 R 从我将在不同时间获得的各种数据帧中将大量数据(数百万个文档)插入到 mongodb 中。

Each data frame will have the same primary id, but can have the same or different attributes.每个数据框将具有相同的主 ID,但可以具有相同或不同的属性。

If the record exists, I would like to add any new attributes and append any existing ones.如果记录存在,我想添加任何新属性并附加任何现有属性。 If the record doesn't exist, I would like to create it.如果记录不存在,我想创建它。

Is this possible in R efficiently?这在 R 中有效吗? I have tried to use the wonderful mongolite package, but the insert option fails because of duplicate records existing.我曾尝试使用美妙的 mongolite 包,但由于存在重复记录,插入选项失败。

Any pointers greatly appreciated.任何指针都非常感谢。

Thanks谢谢

Iain伊恩

id<-LETTERS[1:5]
value1<-paste0("value1_",letters[1:5])
value2<-paste0("value2_",letters[1:5])
value3<-paste0("additional_value_1",letters[1:5])
df1<-as.data.frame(cbind(id,value1))
df2<-as.data.frame(cbind(id,value2))
df3<-as.data.frame(cbind(id,value3))

colnames(df1)<-c('_id','value1')
colnames(df2)<-c('_id','value2')
colnames(df3)<-c('_id','value1')

desired_value1<-paste0( "[",paste(paste0("'",value1,"'"),paste0("'",value3,"'"),sep=","),"]")
df4<-cbind(id,desired_value1,value2)
df4<-as.data.frame(cbind(id,desired_value1,value2))
colnames(df4)<-c("_id","value1","value2")

Note笔记

This answer uses library(rmongodb) , which is no longer supported nor on Cran.这个答案使用library(rmongodb) ,它不再受支持,也不在 Cran 上。


This answer will partly depend on how you're getting your 'new' data.frames.这个答案部分取决于您如何获得“新”data.frames。 I also can't answer the efficiently part either without knowing your setup and size of data, but hopefully this will get you started.在不知道您的设置和数据大小的情况下,我也无法回答有效的部分,但希望这能让您开始。 Plus, I've found inserting/retreiving millions of records into mongo from R to be quite slow.另外,我发现从 R 向 mongo 插入/检索数百万条记录非常慢。

One way of doing this would be for every new data.frame you get, bring back the matching records into R and 'join/update' them, then update the database for just those documents, while appending the new data using an update / upsert query.这样做的一种方法是对于您获得的每个新data.frame ,将匹配的记录带回 R 并“加入/更新”它们,然后仅更新这些文档的数据库,同时使用update / upsert附加新数据询问。

I also use library(rmongodb) for most of my r-mongodb work我的大部分 r-mongodb 工作也使用library(rmongodb)

Slightly modifying your data to use id instead of _id :稍微修改您的数据以使用id而不是_id

id<-LETTERS[1:5]
value1<-paste0("value1_",letters[1:5])
value2<-paste0("value2_",letters[1:5])
value3<-paste0("additional_value_1",letters[1:5])
df1<-as.data.frame(cbind(id,value1), stringsAsFactors = F)  ## removed factor levels
df2<-as.data.frame(cbind(id,value2), stringsAsFactors = F)
df3<-as.data.frame(cbind(id,value3), stringsAsFactors = F)

colnames(df1)<-c('id','value1')
colnames(df2)<-c('id','value2')
colnames(df3)<-c('id','value1')

desired_value1<-paste0( "[",paste(paste0("'",value1,"'"),paste0("'",value3,"'"),sep=","),"]")
df4<-cbind(id,desired_value1,value2)
df4<-as.data.frame(cbind(id,desired_value1,value2))
colnames(df4)<-c("_id","value1","value2")

The first step is to insert it into the database第一步是将其插入到数据库中

library(rmongodb)  ## my preferred r mongodb package 
library(jsonlite)  ## for viewing/checking results 
library(data.table) ## for fast rbind & data frame manipulation

mongo <- mongo.create()
mongo.is.connected(mongo)
# [1] TRUE

db <- "test"
coll <- "test"

bs <- mongo.bson.from.df(df1)
ns <- paste0(db, ".", coll)

## insert.batch - insert each 'row' of the df as a document
mongo.insert.batch(mongo = mongo, 
                   ns = ns, 
                   lst = bs)  
# [1] TRUE

Retrieve all documents to check the upload检索所有文件以检查上传

f <- mongo.bson.from.list(list("_id" = 0))  ## to ignore the _id field
res <- mongo.find.all(mongo = mongo, ns = ns, fields = f)
toJSON(res, pretty=T)
# [
#   {
#     "id": ["A"],
#     "value1": ["value1_a"]
#   },
#   {
#     "id": ["B"],
#     "value1": ["value1_b"]
#   },
#   {
#     "id": ["C"],
#     "value1": ["value1_c"]
#   },
#   {
#     "id": ["D"],
#     "value1": ["value1_d"]
#   },
#   {
#     "id": ["E"],
#     "value1": ["value1_e"]
#   }
# ] 

Now, if we want to add our df2$value2 into those documents, we can bring them back into R to manipulate them, then update the database现在,如果我们想将df2$value2添加到这些文档中,我们可以将它们带回 R 来操作它们,然后更新数据库

qry <- list("id" = list("$in" = df2$id))
## mongo shell query: db.test.find({"id" : { "$in" : ["A", "B", ..., ]}})
qry <- mongo.bson.from.list(qry)
f <- list("_id" = 0)
res <- mongo.find.all(mongo = mongo, 
                      ns = ns,
                      query = qry,
                      fields = f)

dt_res <- rbindlist(res)

## set our df2 to data.table, and join onto dt_res
setDT(df2)

## add a new row to df2, with a new id, to check the update.upsert works
df2 <- rbindlist(list(df2, data.table(id = "Z", value2 = "value2_z")))

dt_res <- dt_res[ df2, on="id"]  ## left join to keep our 'z' row
dt_res
#    id   value1   value2
# 1:  A value1_a value2_a
# 2:  B value1_b value2_b
# 3:  C value1_c value2_c
# 4:  D value1_d value2_d
# 5:  E value1_e value2_e
# 6:  Z       NA value2_z

We can now update the database with these new values using udpate and upsert我们现在可以使用udpateupsert用这些新值更新数据库

for(i in 1:nrow(dt_res)){

  crit <- mongo.bson.from.list(list("id" = dt_res[i, id]))
  d <- c(dt_res[i, ])
  mongo.update(mongo = mongo, 
               ns = ns, 
               criteria = crit, 
               objNew = d, 
               flags = c(mongo.update.upsert))    
}

Check the documnets have been udpated by returning everything通过返回所有内容来检查文档是否已更新

f <- mongo.bson.from.list(list("_id" = 0))  ## to ignore the _id field
res <- mongo.find.all(mongo = mongo, ns = ns, fields = f)
toJSON(res, pretty=T)

# [
#   {
#     "id": ["A"],
#     "value1": ["value1_a"],
#     "value2": ["value2_a"]
#   },
#   {
#     "id": ["B"],
#     "value1": ["value1_b"],
#     "value2": ["value2_b"]
#   },
#   {
#     "id": ["C"],
#     "value1": ["value1_c"],
#     "value2": ["value2_c"]
#   },
#   {
#     "id": ["D"],
#     "value1": ["value1_d"],
#     "value2": ["value2_d"]
#   },
#   {
#     "id": ["E"],
#     "value1": ["value1_e"],
#     "value2": ["value2_e"]
#   },
#   {
#     "id": ["Z"],
#     "value1": {},
#     "value2": ["value2_z"]
#   }
# ] 

Note this includes our new 'z' id请注意,这包括我们新的“z”ID

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM