使用 tidyverse 识别链接的文档（文档树/沿袭）

Question

我有许多由唯一项目编号 ( item_nr ) 和文本 ( text ) 组成的文本文档 ( items )

这些项目可能在text中的item_nr上链接到无、一个或多个其他项目

我有一些起始项目（ start_items ），我想识别所有链接项目的树（谱系），直到它们结束（一个不链接另一个项目的项目）。

示例数据

# library
library(tidyverse)

#example data
start_items=structure(list(item_nr = c("31", "32", "33", "34", "35"), text = c("I do not link", 
                                                                           "I link 16", "I link 26", "I link 99", "I do not know")), row.names = c(NA, 
                                                                                                                                                   -5L), class = c("tbl_df", "tbl", "data.frame"))

items=structure(list(item_nr = c("10", "11", "12", "13", "14", "15", "16", 
                              "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", 
                              "28", "29", "30"), text = c("I have no link", "I link 12", "hi", 
                                                          "how", "are", "you", "I link 17", "I link 18", "I link 19", "here it ends", 
                                                          "I have no link", "thank", "you", "for", "your", "help", "I link 27 and 28", 
                                                          "yes?", "I link 29", "Me neither", "I link 11")), row.names = c(NA, 
                                                                                                                          -21L), class = c("tbl_df", "tbl", "data.frame"))

# show data
start_items
#> # A tibble: 5 x 2
#>   item_nr text         
#>   <chr>   <chr>        
#> 1 31      I do not link
#> 2 32      I link 16    
#> 3 33      I link 26    
#> 4 34      I link 99    
#> 5 35      I do not know

items
#> # A tibble: 21 x 2
#>    item_nr text          
#>    <chr>   <chr>         
#>  1 10      I have no link
#>  2 11      I link 12     
#>  3 12      hi            
#>  4 13      how           
#>  5 14      are           
#>  6 15      you           
#>  7 16      I link 17     
#>  8 17      I link 18     
#>  9 18      I link 19     
#> 10 19      here it ends  
#> # ... with 11 more rows

我尝试了什么（dplyr 方法）

# make links function
func <-function(x){
  tib <- tibble(item_nr=unlist(str_extract_all(x,"1[0-9]|2[0-9]|3[0-9]")))
  if(nrow(tib)<1){return(tibble(item_nr=c(NA_character_),text=c(NA_character_)))}
  left_join(tib,items) -> res
  return(res)
}

# apply function
start_items %>% 
  group_by(item_nr) %>% 
  mutate(link1=list(func(text))) %>% unnest() %>% 
  group_by(item_nr,text1) %>% 
  mutate(link2=list(func(text1))) %>% unnest() %>% 
  group_by(item_nr,text2) %>% 
  mutate(link2=list(func(text2))) %>% unnest() %>% 
  group_by(item_nr,text3) %>% 
  mutate(link2=list(func(text3))) %>% unnest() -> output

# output
output
#> # A tibble: 6 x 10
#> # Groups:   item_nr, text3 [6]
#>   item_nr text    item_nr1 text1   item_nr2 text2 item_nr3 text3 item_nr4 text4 
#>   <chr>   <chr>   <chr>    <chr>   <chr>    <chr> <chr>    <chr> <chr>    <chr> 
#> 1 31      I do n~ <NA>     <NA>    <NA>     <NA>  <NA>     <NA>  <NA>     <NA>  
#> 2 32      I link~ 16       I link~ 17       I li~ 18       I li~ 19       here ~
#> 3 33      I link~ 26       I link~ 27       yes?  <NA>     <NA>  <NA>     <NA>  
#> 4 33      I link~ 26       I link~ 28       I li~ 29       Me n~ <NA>     <NA>  
#> 5 34      I link~ <NA>     <NA>    <NA>     <NA>  <NA>     <NA>  <NA>     <NA>  
#> 6 35      I do n~ <NA>     <NA>    <NA>     <NA>  <NA>     <NA>  <NA>     <NA>

但是，我觉得我的代码非常笨拙，需要大量重复才能在我的真实数据中跟踪所有文档树到它们的末尾（我不知道树的大小）。

有没有办法编写一个 function 运行直到所有树都被完全识别？

感谢您的任何提示。 如果更可行的话，我对导致另一种 output 格式（例如嵌套结构）的解决方案非常满意。

Answer 1

这是一个有趣的调查问题:-)

你的问题是一个经典的递归问题，当你第一次看到它时，这是一个很难理解的概念。

因为你不知道会有多少递归，所以long格式更好。

在这里，只要有要解析的链接，递归的 function 就会调用自己。 转义条件基于剩余链接的数量。 但是，我添加了一个max_r值以避免陷入无限循环，以防您有一个项目链接到自身（直接或不链接）。

初始化循环 ( if(r==0) ) 仅用于准备长格式，其中单个项目可以在多行上：有一个源项目、一个当前项目和一个当前递归数。 如果您不想更改数据集格式，则应将其外部化以简化 function （然后从r=1开始）。

library(tidyverse)
                                                                    
recursive_func = function(df, r=0, max_r=10){
  if(r==0){
    df = df %>% 
      transmute(source_item=item_nr,
                item_nr=item_nr,
                rec=0, 
                text=text)
    return(df %>% recursive_func(r=1))
  }
  
  df2 = df %>% 
    filter(rec==r-1) %>% 
    mutate(item_nr = str_extract_all(text,"[1-3][0-9]"),
           rec=r) %>% 
    unnest(item_nr) %>% 
    left_join(items, 
              by=c("item_nr"), suffix=c("_old", "")) %>% 
    select(-text_old)
  
  
  if(nrow(df2)==0 || r>max_r){
    return(df)
  }
  
  bind_rows(df, df2) %>% 
    arrange(source_item, rec) %>% 
    recursive_func(r=r+1)
}


start_items %>% 
  group_by(item_nr) %>%
  recursive_func()
#> # A tibble: 13 x 4
#> # Groups:   item_nr [13]
#>    source_item item_nr   rec text            
#>    <chr>       <chr>   <dbl> <chr>           
#>  1 31          31          0 I do not link   
#>  2 32          32          0 I link 16       
#>  3 32          16          1 I link 17       
#>  4 32          17          2 I link 18       
#>  5 32          18          3 I link 19       
#>  6 32          19          4 here it ends    
#>  7 33          33          0 I link 26       
#>  8 33          26          1 I link 27 and 28
#>  9 33          27          2 yes?            
#> 10 33          28          2 I link 29       
#> 11 33          29          3 Me neither      
#> 12 34          34          0 I link 99       
#> 13 35          35          0 I do not know

^{由代表 package (v2.0.0) 于 2021 年 5 月 5 日创建}

使用 tidyverse 识别链接的文档（文档树/沿袭）

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-05-05 13:38:05

使用 tidyverse 识别链接的文档（文档树/沿袭）

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-05-05 13:38:05

解决方案1
1 已采纳 2021-05-05 13:38:05