简体   繁体   中英

R Efficient way to update files

I would like to check duplicates (if record already exists in a file) if it does, then delete and write it again, if it doesn't then just append it. Currently I'm updating new information to a file with following pattern:

library(dplyr)
library(purrr)
library(readr)

# CREATE FILE
fst <- tibble(id = 1,
       val = rnorm(1),
       val2 = rnorm(1))

readr::write_rds(fst, "example_file.rds")

create_data <- possibly(function(id = 1L){
  
  dt_out <- dplyr::tibble(id = id,
                   val = rnorm(1),
                   val2 = rnorm(1))
  
  out <- readr::read_rds("example_file.rds") %>% 
    bind_rows(dt_out) %>% 
    distinct(id, .keep_all = T)
  
  readr::write_rds(out, "example_file.rds")
  
}, otherwise = NA)

links <- c(1,1,2,3,2,3,4,5)

res <- purrr::map(links, ~create_data(.x))

read_rds("example_file.rds")
# A tibble: 5 x 3
     id    val   val2
  <dbl>  <dbl>  <dbl>
1     1  0.430  0.636
2     2 -0.348 -0.507
3     3  0.936 -0.343
4     4  0.871  1.59 
5     5 -1.06  -0.308

So I have function to get data and inside it I bind new data to old file and check for duplicates. My idea is to write single file from each function run and combine them later stages. So not to have one big file, but 1000s smaller ones. Also with this method, I can't really control what instance of record is kept, as I think distinct keeps the first record and I don't have anything to tell me what is first.

File is getting too big and I don't have enought memory to read and write it back and forth, when function runs multiple times. Is there alternative method, where I don't need to read whole file and achieve same result where only 1 file is updated with new information?

Not exactly what I'm looking for, as I like files in computer and haven't worked with databases, but this feels pretty good. MongoDB can handle nested dataframes which I actually have in my data (hence.rds format). Installing mongodb wasn't too bad.

library(mongolite)

example_db <- mongo("example_db", url = "mongodb://127.0.0.1:27017/db_name")

fst <- tibble(id = 1,
              val = rnorm(1),
              val2 = rnorm(1))

example_db$insert(fst)

create_data <- purrr::possibly(function(id = 1L){
  
  dt_out <- dplyr::tibble(id = id,
                          val = rnorm(1),
                          val2 = rnorm(1))
  
  example_db$remove(paste0('{"id": {"$in":', jsonlite::toJSON(id), '} }'))
  example_db$insert(dt_out)
  
}, otherwise = NA)

links <- c(1,1,2,3,2,3,4,5)
res <- purrr::map(links, ~create_data(.x))

(example_db$find())
  id        val       val2
1  1  0.3772453 -0.4799636
2  2  0.3282423 -0.7768333
3  3 -1.1129543 -1.6095890
4  4 -0.9314038 -1.4073236
5  5  0.4243383 -1.0557676

Of course data is not in a file now. This way I'm pretty sure that only latest ID is in DB as it will be removed and then inserted again. Also with $in , this can be modifed to handle multiple ID at once. If file could be work similar way, it would be great.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM