简体   繁体   中英

How do I iterate over two consecutive indexed objects using purrr::map()?

I have >100 csv-files, all with the same structure. Each csv-file is a daily snapshot of the metadata of all documents in a system. The filenames contain the snapshot-date. The metadata contain Document_ID, Status, Author and some other columns. Every row represents a document's metadata.

I would like to create a log of all changes over time. So I first load all the files into a single tibble, using:

df <- fs::dir_ls(path = "Files") %>% 
  purrr::map_dfr(read_csv, .id = "Filename")

The original filename containing the snapshot-date is now in the first column. Here's a simplified reprex of the resulting df:

library(tidyverse)

df <- tibble(Filename = c(rep("File_2020-09-27", 2), rep("File_2020-09-28", 3), rep("File_2020-09-29", 4), rep("File_2020-09-30", 5)),
             Doc_ID = c(seq(1, 2), seq(1, 3), seq(1, 4), seq(1, 5)),
             Status = c("Finished", "Started", 
                        "Finished", "Started", "Started", 
                        "Finished", "Started", "Finished", "Started",
                        "Finished", "Waiting", "Finished", "Started", "Started"),
             Author = c("John", "John",
                        "John", "Mike", "John",
                        "John", "Mike", "John", "Mike",
                        "John", "Mike", "John", "Mike", "Betty"),
             Other_column = rnorm(14))
df
#> # A tibble: 14 x 5
#>    Filename        Doc_ID Status   Author Other_column
#>    <chr>            <int> <chr>    <chr>         <dbl>
#>  1 File_2020-09-27      1 Finished John          0.319
#>  2 File_2020-09-27      2 Started  John          0.633
#>  3 File_2020-09-28      1 Finished John          2.27 
#>  4 File_2020-09-28      2 Started  Mike          0.302
#>  5 File_2020-09-28      3 Started  John          0.905
#>  6 File_2020-09-29      1 Finished John          0.451
#>  7 File_2020-09-29      2 Started  Mike          1.46 
#>  8 File_2020-09-29      3 Finished John          0.306
#>  9 File_2020-09-29      4 Started  Mike         -0.850
#> 10 File_2020-09-30      1 Finished John         -2.03 
#> 11 File_2020-09-30      2 Waiting  Mike          0.250
#> 12 File_2020-09-30      3 Finished John          0.637
#> 13 File_2020-09-30      4 Started  Mike         -0.207
#> 14 File_2020-09-30      5 Started  Betty        -2.13

Created on 2020-10-02 by the reprex package (v0.3.0)

Note that documents never disappear, they just change their status or author. To manually create the desired output, I first create separate tibbles for every daily snapshot:

Docs_1 <- df %>% filter(Filename == "File_2020-09-27")
Docs_2 <- df %>% filter(Filename == "File_2020-09-28")
Docs_3 <- df %>% filter(Filename == "File_2020-09-29")
Docs_4 <- df %>% filter(Filename == "File_2020-09-30")

For every consecutive pair of daily snapshots I then identify the rows of the next day, that are new or different from the previous day. I'm only interested in these. "New" or "different" relates to the combination of Doc_ID , Status and Author :

Changes_1_2 <- Docs_2 %>% dplyr::anti_join(Docs_1, by = c("Doc_ID", "Status", "Author"))

resulting in:

# A tibble: 2 x 5
  Filename        Doc_ID Status  Author Other_column
  <chr>            <int> <chr>   <chr>         <dbl>
1 File_2020-09-28      2 Started Mike          0.807
2 File_2020-09-28      3 Started John          0.336
Changes_2_3 <- Docs_3 %>% dplyr::anti_join(Docs_2, by = c("Doc_ID", "Status", "Author"))

resulting in:

# A tibble: 2 x 5
  Filename        Doc_ID Status   Author Other_column
  <chr>            <int> <chr>    <chr>         <dbl>
1 File_2020-09-29      3 Finished John         1.48  
2 File_2020-09-29      4 Started  Mike        -0.0407
Changes_3_4 <- Docs_4 %>% dplyr::anti_join(Docs_3, by = c("Doc_ID", "Status", "Author"))

resulting in:

# A tibble: 2 x 5
  Filename        Doc_ID Status  Author Other_column
  <chr>            <int> <chr>   <chr>         <dbl>
1 File_2020-09-30      2 Waiting Mike         -0.267
2 File_2020-09-30      5 Started Betty        -1.36 

Finally, I bind all changes together to get a log of all changes in a single tibble:

Changelog <- dplyr::bind_rows(Changes_1_2, Changes_2_3, Changes_3_4)

resulting in:

# A tibble: 6 x 5
  Filename        Doc_ID Status   Author Other_column
  <chr>            <int> <chr>    <chr>         <dbl>
1 File_2020-09-28      2 Started  Mike         0.807 
2 File_2020-09-28      3 Started  John         0.336 
3 File_2020-09-29      3 Finished John         1.48  
4 File_2020-09-29      4 Started  Mike        -0.0407
5 File_2020-09-30      2 Waiting  Mike        -0.267 
6 File_2020-09-30      5 Started  Betty       -1.36  

For every Doc_ID I can then analyze the changes of their metadata over time in the Changelog.

Given the sheer number of files and entries, I need a more elegant solution to create the Changelog. How can I implement this procedure using iterations, preferably with the purrr::map() function of the tidyverse ? My problem is that every iteration targets two consecutive indexed objects, and I found no example of this anywhere. I'm thinking of something like this (obviously this code doesn't work, just inventing my own notaton for illustration):

Changelog <- df %>% split(.$Date) %>% 
  purrr::map_dfr(df_index+1 %>% dplyr::anti_join(df_index, by = c("Doc_ID", "Status", "Author")))

Does anybody know how to solve this problem? Maybe I should also change the initial loading of the csv-files into a list, instead of loading them into a single tibble.

I think we can group/nest, lag, and compare:

library(dplyr)
library(tidyr) # unnest
set.seed(42) # and then your `df <- tibble(...)`

df %>%
  nest_by(Filename) %>%
  ungroup() %>%
  mutate(lastdata = lag(data)) %>%
  filter(lengths(lastdata) > 0) %>%
  mutate(
    diffs = purrr::map2(data, lastdata, ~ anti_join(.x, .y, by = c("Doc_ID", "Status", "Author")))
  ) %>%
  select(-data, -lastdata) %>%
  tidyr::unnest(diffs)
# # A tibble: 6 x 5
#   Filename        Doc_ID Status   Author Other_column
#   <chr>            <int> <chr>    <chr>         <dbl>
# 1 File_2020-09-28      2 Started  Mike         0.633 
# 2 File_2020-09-28      3 Started  John         0.404 
# 3 File_2020-09-29      3 Finished John        -0.0947
# 4 File_2020-09-29      4 Started  Mike         2.02  
# 5 File_2020-09-30      2 Waiting  Mike         1.30  
# 6 File_2020-09-30      5 Started  Betty       -0.279 

The important steps to look at are:

  1. Initial grouped/nested setup:

     df %>% nest_by(Filename) %>% ungroup() %>% mutate(lastdata = lag(data)) # # A tibble: 4 x 3 # Filename data lastdata # <chr> <list<tbl_df[,4]>> <list<tbl_df[,4]>> # 1 File_2020-09-27 [2 x 4] [0] # 2 File_2020-09-28 [3 x 4] [2 x 4] # 3 File_2020-09-29 [4 x 4] [3 x 4] # 4 File_2020-09-30 [5 x 4] [4 x 4]

    where the data column effectively contains your Docs_1 through Docs_2 , and lastdata contains the previous of data . This means that for the second row, data contains rows from 09-28 and lastdata contains rows from 09-27 .

  2. Since we can't compare 09-27 to a previous day (and its lastdata is empty), we filter it out:

     filter(lengths(lastdata) > 0)
  3. Finally, we iterate down the columns, anti_join ing each pair of lastdata and data :

     mutate( diffs = purrr::map2(data, lastdata, ~ anti_join(.x, .y, by = c("Doc_ID", "Status", "Author"))) ) # # A tibble: 3 x 4 # Filename data lastdata diffs # <chr> <list<tbl_df[,4]>> <list<tbl_df[,4]>> <list> # 1 File_2020-09-28 [3 x 4] [2 x 4] <tibble [2 x 4]> # 2 File_2020-09-29 [4 x 4] [3 x 4] <tibble [2 x 4]> # 3 File_2020-09-30 [5 x 4] [4 x 4] <tibble [2 x 4]>
  4. Clean up by removing and unnesting.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM