简体   繁体   English

如何重新编码<null>单元格嵌套 NA (<lgl [1]> ) 在 tibble 的列表列中?</lgl></null>

[英]How to recode <NULL> cells to nested NA (<lgl [1]>) in a tibble's list-column?

In a tibble with list-columns, how could I replace <NULL> entries with nested NA (which will take the nested form of <lgl [1]> )?在带有列表列的小标题中,我如何用嵌套的NA替换<NULL>条目(它将采用<lgl [1]>的嵌套形式)?

library(tibble)

tbl_with_null <-
  tibble(letter =  letters[1:10],
       value_1 = list(1, 2, 4, data.frame(a = 1, 2, 3), NULL, 6, 7, c(8, 11, 25), NULL, 10),
       value_2 = list("A", "B", "C", "D", NULL, NULL, NULL, list("H", "B", list(data.frame(id = 1:3))), "I", "J"))

> tbl_with_null
 
## # A tibble: 10 x 3
##    letter value_1          value_2   
##    <chr>  <list>           <list>    
##  1 a      <dbl [1]>        <chr [1]> 
##  2 b      <dbl [1]>        <chr [1]> 
##  3 c      <dbl [1]>        <chr [1]> 
##  4 d      <df[,3] [1 x 3]> <chr [1]> 
##  5 e      <NULL>           <NULL>    
##  6 f      <dbl [1]>        <NULL>    
##  7 g      <dbl [1]>        <NULL>    
##  8 h      <dbl [3]>        <list [3]>
##  9 i      <NULL>           <chr [1]> 
## 10 j      <dbl [1]>        <chr [1]> 

Is there a way to act on the entire tbl_with_null to replace <NULL> with NA to get:有没有办法对整个tbl_with_null采取行动,用NA替换<NULL>以获得:

## # A tibble: 10 x 3
##    letter value_1                 value_2   
##    <chr>  <list>                  <list>    
##  1 a      <dbl [1]>               <chr [1]> 
##  2 b      <dbl [1]>               <chr [1]> 
##  3 c      <dbl [1]>               <chr [1]> 
##  4 d      <df[,3] [1 x 3]>        <chr [1]> 
##  5 e      <lgl [1]> <- NA         <lgl [1]>  # <- NA
##  6 f      <dbl [1]>               <lgl [1]>  # <- NA
##  7 g      <dbl [1]>               <lgl [1]>  # <- NA
##  8 h      <dbl [3]>               <list [3]>
##  9 i      <lgl [1]> <- NA         <chr [1]> 
## 10 j      <dbl [1]>               <chr [1]> 

UPDATE更新


I made some progress based on this solution :我基于这个解决方案取得了一些进展:

tbl_with_null %>%
  mutate(across(c(value_1, value_2), ~replace(., !lengths(.), list(NA))))

## # A tibble: 10 x 3
##    letter value_1          value_2   
##    <chr>  <list>           <list>    
##  1 a      <dbl [1]>        <chr [1]> 
##  2 b      <dbl [1]>        <chr [1]> 
##  3 c      <dbl [1]>        <chr [1]> 
##  4 d      <df[,3] [1 x 3]> <chr [1]> 
##  5 e      <lgl [1]>        <lgl [1]> 
##  6 f      <dbl [1]>        <lgl [1]> 
##  7 g      <dbl [1]>        <lgl [1]> 
##  8 h      <dbl [3]>        <list [3]>
##  9 i      <lgl [1]>        <chr [1]> 
## 10 j      <dbl [1]>        <chr [1]> 

However , this is insufficient because I'm looking for a solution that would blindly replace NULL with NA across the entire dataframe.但是,这还不够,因为我正在寻找一种解决方案,可以在整个 dataframe 中盲目地NA替换NULL And if we go with mutate(across(everything(), ~replace(., .lengths(,), list(NA)))) we get that the letters column became a list-column too, which is unintended.如果我们 go 与mutate(across(everything(), ~replace(., .lengths(,), list(NA))))我们得到letters列也变成了一个列表列,这是无意的。

## # A tibble: 10 x 3
##    letter    value_1          value_2   
##    <list>    <list>           <list>    
##  1 <chr [1]> <dbl [1]>        <chr [1]> 
##  2 <chr [1]> <dbl [1]>        <chr [1]> 
##  3 <chr [1]> <dbl [1]>        <chr [1]> 
##  4 <chr [1]> <df[,3] [1 x 3]> <chr [1]> 
##  5 <chr [1]> <lgl [1]>        <lgl [1]> 
##  6 <chr [1]> <dbl [1]>        <lgl [1]> 
##  7 <chr [1]> <dbl [1]>        <lgl [1]> 
##  8 <chr [1]> <dbl [3]>        <list [3]>
##  9 <chr [1]> <lgl [1]>        <chr [1]> 
## 10 <chr [1]> <dbl [1]>        <chr [1]> 

UPDATE 2更新 2


I thought that I got it done with我以为我已经完成了

mutate(across(everything(), ~simplify(replace(., !lengths(.), list(NA)))))

But unfortunately this fails in some cases such as this data:但不幸的是,这在某些情况下会失败,例如以下数据:

tbl_with_no_null <-
  tbl_with_null %>%
  slice(8) %>%
  select(letter, value_1)

## # A tibble: 1 x 2
##   letter value_1  
##   <chr>  <list>   
## 1 h      <dbl [3]>

While I was expecting that在我期待的时候

tbl_with_no_null %>%
  mutate(across(everything(), ~simplify(replace(., !lengths(.), list(NA)))))

would return just the same tbl_with_no_null (because no <NULL> to replace):将返回相同的tbl_with_no_null (因为没有要替换的<NULL> ):

## # A tibble: 1 x 2
##   letter value_1  
##   <chr>  <list>   
## 1 h      <dbl [3]>

But instead I got the error:但相反,我得到了错误:

Error: Problem with `mutate()` input `..1`.
x Input `..1` can't be recycled to size 1.
i Input `..1` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
i Input `..1` must be size 1, not 3.

Bottom line底线

I'm looking for a way to replace <NULL> with NA in list columns, and naturally, if there's no <NULL> to replace, then return the input as-is.我正在寻找一种在列表列中用NA替换<NULL>的方法,当然,如果没有要替换的<NULL> ,则按原样返回输入。

base::rapply doesn't recurse through NULL , but you could use rrapply which allows this, and is quite efficient: base::rapply不会通过NULL递归,但是您可以使用rrapply允许这样做,并且非常有效:

library(rrapply)
rrapply::rrapply(tbl_with_null, function(x) NA, how = "replace", condition = is.null)

# A tibble: 10 x 3
   letter value_1          value_2   
   <chr>  <list>           <list>    
 1 a      <dbl [1]>        <chr [1]> 
 2 b      <dbl [1]>        <chr [1]> 
 3 c      <dbl [1]>        <chr [1]> 
 4 d      <df[,3] [1 x 3]> <chr [1]> 
 5 e      <lgl [1]>        <lgl [1]> 
 6 f      <dbl [1]>        <lgl [1]> 
 7 g      <dbl [1]>        <lgl [1]> 
 8 h      <dbl [3]>        <list [3]>
 9 i      <lgl [1]>        <chr [1]> 
10 j      <dbl [1]>        <chr [1]> 

Or as suggested by @JorisC.或者按照@JorisC 的建议。 in comments, use the class argument which seems to be up to 25% faster on large lists:在评论中,使用class参数,这在大型列表上似乎快了 25%:

rrapply(tbl_with_null, classes = "NULL", how = "replace", f = function(x) NA)

And just for fun:只是为了好玩:

eval(parse(text=gsub("NULL","NA",capture.output(dput(tbl_with_null)))))

# A tibble: 10 x 3
   letter value_1          value_2   
   <chr>  <list>           <list>    
 1 a      <dbl [1]>        <chr [1]> 
 2 b      <dbl [1]>        <chr [1]> 
 3 c      <dbl [1]>        <chr [1]> 
 4 d      <df[,3] [1 x 3]> <chr [1]> 
 5 e      <lgl [1]>        <lgl [1]> 
 6 f      <dbl [1]>        <lgl [1]> 
 7 g      <dbl [1]>        <lgl [1]> 
 8 h      <dbl [3]>        <list [3]>
 9 i      <lgl [1]>        <chr [1]> 
10 j      <dbl [1]>        <chr [1]> 

fortunes::fortune(106)

# If the answer is parse() you should usually rethink the question.
#   -- Thomas Lumley
#      R-help (February 2005)

Speed comparison is surprising, I would have expected parse to be the slowest solution:速度比较令人惊讶,我原以为parse是最慢的解决方案:

microbenchmark::microbenchmark(
  rrapply = rrapply::rrapply(tbl_with_null, function(x) NA, how = "replace", condition = is.null),
  parse = eval(parse(text=gsub("NULL","NA",capture.output(dput(tbl_with_null))))),
  dplyr = mutate(tbl_with_null,across(where(is.list), .fns = map_if, .p = is.null, .f = function(x) NA)))
Unit: microseconds
    expr      min       lq       mean    median        uq      max neval cld
 rrapply   25.401   31.801   60.92102   51.2510   58.3010 1053.502   100 a  
   parse  225.001  269.701  327.31600  329.1005  362.4505  687.800   100  b 
   dplyr 2942.501 3207.301 3604.63105 3500.0005 3766.1510 6541.402   100   c

I would suggest the following approach.我建议采用以下方法。

# packages
library(tibble)
library(purrr)
library(dplyr)

# data
tbl_with_null <-
  tibble(
    letter = letters[1:10],
    value_1 = list(1, 2, 4, data.frame(a = 1, 2, 3), NULL, 6, 7, c(8, 11, 25), NULL, 10),
    value_2 = list("A", "B", "C", "D", NULL, NULL, NULL, list("H", "B", list(data.frame(id = 1:3))), "I", "J")
  )

# replace all NULL in list format with NA
tbl_with_null %>% 
  mutate(across(where(is.list), .fns = map_if, .p = is.null, .f = function(x) NA))
#> # A tibble: 10 x 3
#>    letter value_1          value_2   
#>    <chr>  <list>           <list>    
#>  1 a      <dbl [1]>        <chr [1]> 
#>  2 b      <dbl [1]>        <chr [1]> 
#>  3 c      <dbl [1]>        <chr [1]> 
#>  4 d      <df[,3] [1 x 3]> <chr [1]> 
#>  5 e      <lgl [1]>        <lgl [1]> 
#>  6 f      <dbl [1]>        <lgl [1]> 
#>  7 g      <dbl [1]>        <lgl [1]> 
#>  8 h      <dbl [3]>        <list [3]>
#>  9 i      <lgl [1]>        <chr [1]> 
#> 10 j      <dbl [1]>        <chr [1]>

# slice 
tbl_with_null %>% 
  slice(8) %>% 
  mutate(across(where(is.list), .fns = map_if, .p = is.null, .f = function(x) NA))
#> # A tibble: 1 x 3
#>   letter value_1   value_2   
#>   <chr>  <list>    <list>    
#> 1 h      <dbl [3]> <list [3]>

Created on 2021-03-14 by the reprex package (v1.0.0)代表 package (v1.0.0) 于 2021 年 3 月 14 日创建

Check the help pages of the corresponding functions for more details (or add a comment here!)查看相应功能的帮助页面以获取更多详细信息(或在此处添加评论!)

You were very close to solving it, If you want to only replace NULLs within nested columns, rather than applying the mutate to everything, just apply it to those columns where the values are typed as lists using where(is.list) instead of everything() as aglia showed above.您非常接近解决它,如果您只想替换嵌套列中的 NULL,而不是将 mutate 应用于所有内容,只需将其应用于那些使用where(is.list)而不是everything()内容作为列表键入值的列everything()如上面的 aglia 所示。 While you can keep the simplify, it doesn't seem to be necessary in my testing.虽然您可以保持简化,但在我的测试中似乎没有必要。

library(tidyverse)

tbl_with_null <-
  tibble(letter =  letters[1:10],
         value_1 = list(1, 2, 4, data.frame(a = 1, 2, 3), NULL, 6, 7, c(8, 11, 25), NULL, 10),
         value_2 = list("A", "B", "C", "D", NULL, NULL, NULL, list("H", "B", list(data.frame(id = 1:3))), "I", "J"))

tbl_with_null %>% 
  mutate(across(where(is.list), ~replace(., !lengths(.), list(NA))))

This solution marginally faster than agila's on my computer while sticking with the tidyverse, though if you're willing to use an additional package, clearly, rrapply is the faster solution.在坚持使用 tidyverse 的同时,此解决方案比我的计算机上的 agila 略快,但如果您愿意使用额外的 package,显然,rrapply 是更快的解决方案。

  > microbenchmark::microbenchmark(
+   rrapply = rrapply::rrapply(tbl_with_null, function(x) NA, how = "replace", condition = is.null),
+   parse = eval(parse(text=gsub("NULL","NA",capture.output(dput(tbl_with_null))))),
+   dplyr1 = mutate(tbl_with_null,across(where(is.list), .fns = map_if, .p = is.null, .f = function(x) NA)),
+   dplyr2 = mutate(tbl_with_null, across(where(is.list), ~simplify(replace(., !lengths(.), list(NA))))),
+   dplyr3 = mutate(tbl_with_null, across(where(is.list), ~replace(., !lengths(.), list(NA))))
+ )  
Unit: microseconds
    expr      min        lq       mean    median       uq      max neval
 rrapply   27.795   42.4015   49.85706   45.9475   49.935  508.133   100
   parse  354.237  371.6450  400.97961  391.9885  425.434  598.792   100
  dplyr1 2472.218 2526.7575 2625.90951 2578.0390 2667.312 3086.635   100
  dplyr2 2270.130 2338.4955 2529.54983 2380.3345 2491.390 7513.478   100
  dplyr3 2243.784 2291.5100 2525.00431 2346.0720 2439.517 7318.504   100

Here are some data.table based solutions, with rrapply is slightly faster, the more tratidional lapply approach is slower:以下是一些基于data.table的解决方案,使用 rrapply 稍快,更传统的 lapply 方法更慢:


dt <- as.data.table( tbl_with_null )
dt.worker <- function(x) {
    if( identical( x, list(NULL) ) )
        return(list(NA))
    return(x)
}

dt[, lapply( .SD, dt.worker ), by = letter ]

rrapply( dt, function(x) NA, how = "replace", condition = is.null)

microbenchmark(
    rrapply = rrapply::rrapply(tbl_with_null, function(x) NA, how = "replace", condition = is.null),
    parse = eval(parse(text=gsub("NULL","NA",capture.output(dput(tbl_with_null))))),
    dplyr1 = mutate(tbl_with_null,across(where(is.list), .fns = map_if, .p = is.null, .f = function(x) NA)),
    dplyr2 = mutate(tbl_with_null, across(where(is.list), ~simplify(replace(., !lengths(.), list(NA))))),
    dplyr3 = mutate(tbl_with_null, across(where(is.list), ~replace(., !lengths(.), list(NA)))),
    dt.lapply = dt[, lapply( .SD, dt.worker ), by = letter ],
    dt.rrapply = rrapply( dt, function(x) NA, how = "replace", condition = is.null)
)


Unit: microseconds
       expr      min        lq       mean    median        uq      max neval  cld
    rrapply   22.592   28.2730   37.91673   35.0210   36.3885  460.414   100 a   
      parse  213.831  242.7650  255.37595  254.2365  267.8920  308.278   100  b  
     dplyr1 1986.615 2028.5695 2197.87663 2061.2655 2082.5410 8258.728   100    d
     dplyr2 1803.212 1836.4240 1934.95871 1861.9965 1895.8655 8053.553   100   c 
     dplyr3 1779.537 1814.3925 1848.84501 1835.6575 1866.9810 2203.042   100   c 
  dt.lapply  287.349  321.2775  349.15118  338.7005  377.2070  446.948   100  b  
 dt.rrapply   16.962   26.1245   32.82651   29.5205   32.3605  425.738   100 a   

running the dplyr::mutate solutions on a data.table seems to be slightly faster than their tibble equivalents, but they're still as expected magnitudes slower.在 data.table 上运行dplyr::mutate解决方案似乎比它们的data.table等价物稍快,但它们仍然像预期的那样慢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM