I would like to check duplicates (if record already exists in a file) if it does, then delete and write it again, if it doesn't then just append it. Currently I'm updating new information to a file with following pattern:
library(dplyr)
library(purrr)
library(readr)
# CREATE FILE
fst <- tibble(id = 1,
val = rnorm(1),
val2 = rnorm(1))
readr::write_rds(fst, "example_file.rds")
create_data <- possibly(function(id = 1L){
dt_out <- dplyr::tibble(id = id,
val = rnorm(1),
val2 = rnorm(1))
out <- readr::read_rds("example_file.rds") %>%
bind_rows(dt_out) %>%
distinct(id, .keep_all = T)
readr::write_rds(out, "example_file.rds")
}, otherwise = NA)
links <- c(1,1,2,3,2,3,4,5)
res <- purrr::map(links, ~create_data(.x))
read_rds("example_file.rds")
# A tibble: 5 x 3
id val val2
<dbl> <dbl> <dbl>
1 1 0.430 0.636
2 2 -0.348 -0.507
3 3 0.936 -0.343
4 4 0.871 1.59
5 5 -1.06 -0.308
So I have function to get data and inside it I bind new data to old file and check for duplicates. My idea is to write single file from each function run and combine them later stages. So not to have one big file, but 1000s smaller ones. Also with this method, I can't really control what instance of record is kept, as I think distinct
keeps the first record and I don't have anything to tell me what is first.
File is getting too big and I don't have enought memory to read and write it back and forth, when function runs multiple times. Is there alternative method, where I don't need to read whole file and achieve same result where only 1 file is updated with new information?
Not exactly what I'm looking for, as I like files in computer and haven't worked with databases, but this feels pretty good. MongoDB can handle nested dataframes which I actually have in my data (hence.rds format). Installing mongodb wasn't too bad.
library(mongolite)
example_db <- mongo("example_db", url = "mongodb://127.0.0.1:27017/db_name")
fst <- tibble(id = 1,
val = rnorm(1),
val2 = rnorm(1))
example_db$insert(fst)
create_data <- purrr::possibly(function(id = 1L){
dt_out <- dplyr::tibble(id = id,
val = rnorm(1),
val2 = rnorm(1))
example_db$remove(paste0('{"id": {"$in":', jsonlite::toJSON(id), '} }'))
example_db$insert(dt_out)
}, otherwise = NA)
links <- c(1,1,2,3,2,3,4,5)
res <- purrr::map(links, ~create_data(.x))
(example_db$find())
id val val2
1 1 0.3772453 -0.4799636
2 2 0.3282423 -0.7768333
3 3 -1.1129543 -1.6095890
4 4 -0.9314038 -1.4073236
5 5 0.4243383 -1.0557676
Of course data is not in a file now. This way I'm pretty sure that only latest ID is in DB as it will be removed and then inserted again. Also with $in
, this can be modifed to handle multiple ID at once. If file could be work similar way, it would be great.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.