[英]Is there a more efficient way to replace NULL with NA in a list?
I quite often come across data that is structured something like this: 我经常遇到这样的结构数据:
employees <- list(
list(id = 1,
dept = "IT",
age = 29,
sportsteam = "softball"),
list(id = 2,
dept = "IT",
age = 30,
sportsteam = NULL),
list(id = 3,
dept = "IT",
age = 29,
sportsteam = "hockey"),
list(id = 4,
dept = NULL,
age = 29,
sportsteam = "softball"))
In many cases such lists could be tens of millions of items long, so memory concerns and efficiency are always a concern. 在许多情况下,此类列表可能长达数千万个项目,因此内存问题和效率始终是一个问题。
I would like to turn the list into a dataframe but if I run: 我想将列表转换为数据帧,但如果我运行:
library(data.table)
employee.df <- rbindlist(employees)
I get errors because of the NULL values. 由于NULL值,我得到错误。 My normal strategy is to use a function like:
我的正常策略是使用如下函数:
nullToNA <- function(x) {
x[sapply(x, is.null)] <- NA
return(x)
}
and then: 然后:
employees <- lapply(employees, nullToNA)
employee.df <- rbindlist(employees)
which returns 返回
id dept age sportsteam
1: 1 IT 29 softball
2: 2 IT 30 NA
3: 3 IT 29 hockey
4: 4 NA 29 softball
However, the nullToNA function is very slow when applied to 10 million cases so it'd be good if there was a more efficient approach. 但是,当应用于1000万个案例时,nullToNA函数非常慢,因此如果有更有效的方法则会很好。
One point that seems to slow the process down it the is.null function can only be applied to one item at a time (unlike is.na which can scan a full list in one go). 有一点似乎减慢了它的进程,is.null函数一次只能应用于一个项目(与可以一次扫描完整列表的is.na不同)。
Any advice on how to do this operation efficiently on a large dataset? 有关如何在大型数据集上有效执行此操作的任何建议?
Many efficiency problems in R are solved by first changing the original data into a form that makes the processes that follow as fast and easy as possible. R中的许多效率问题通过首先将原始数据更改为使得后续过程尽可能快速和简单的形式来解决。 Usually, this is matrix form.
通常,这是矩阵形式。
If you bring all the data together with rbind
, your nullToNA
function no longer has to search though nested lists, and therefore sapply
serves its purpose (looking though a matrix) more efficiently. 如果你把所有的数据一起
rbind
,你nullToNA
功能不再拥有搜索虽然嵌套列表,因此sapply
用于其目的(虽然看一个矩阵)更有效。 In theory, this should make the process faster. 从理论上讲,这应该会使流程更快。
Good question, by the way. 顺便问一下好问题。
> dat <- do.call(rbind, lapply(employees, rbind))
> dat
id dept age sportsteam
[1,] 1 "IT" 29 "softball"
[2,] 2 "IT" 30 NULL
[3,] 3 "IT" 29 "hockey"
[4,] 4 NULL 29 "softball"
> nullToNA(dat)
id dept age sportsteam
[1,] 1 "IT" 29 "softball"
[2,] 2 "IT" 30 NA
[3,] 3 "IT" 29 "hockey"
[4,] 4 NA 29 "softball"
A two step approach creates a dataframe after combing it with rbind
: 在使用
rbind
对数据帧进行梳理后,两步法会创建一个数据帧:
employee.df<-data.frame(do.call("rbind",employees))
Now replace the NULL's, I am using "NULL" as R doesn't put NULL when you load the data and is reading it as character when you load it. 现在替换NULL,我使用“NULL”,因为R在加载数据时没有放置NULL,并且在加载数据时将其作为字符读取。
employee.df.withNA <- sapply(employee.df, function(x) ifelse(x == "NULL", NA, x))
A tidyverse solution that I find easier to read is to write a function that works on a single element and map it over all of your NULLs. 我发现更易于阅读的整合解决方案是编写一个对单个元素起作用的函数,并将其映射到所有NULL上。
I'll use @rich-scriven's rbind and lapply approach to create a matrix, and then turn that into a dataframe. 我将使用@ rich-scriven的rbind和lapply方法创建一个矩阵,然后将其转换为数据帧。
library(magrittr)
dat <- do.call(rbind, lapply(employees, rbind)) %>%
as.data.frame()
dat
#> id dept age sportsteam
#> 1 1 IT 29 softball
#> 2 2 IT 30 NULL
#> 3 3 IT 29 hockey
#> 4 4 NULL 29 softball
Then we can use purrr::modify_depth()
at a depth of 2 to apply replace_x()
然后我们可以在2的深度使用
purrr::modify_depth()
来应用replace_x()
replace_x <- function(x, replacement = NA_character_) {
if (length(x) == 0 || length(x[[1]]) == 0) {
replacement
} else {
x
}
}
out <- dat %>%
purrr::modify_depth(2, replace_x)
out
#> id dept age sportsteam
#> 1 1 IT 29 softball
#> 2 2 IT 30 NA
#> 3 3 IT 29 hockey
#> 4 4 NA 29 softball
All of these solutions (I think) are hiding the fact that the data table is still a lost of lists and not a list of vectors (I did not notice in my application either until it started throwing unexpected errors during :=
). 所有这些解决方案(我认为)都隐藏了这样一个事实,即数据表仍然是列表丢失而不是向量列表(我在应用程序中没有注意到,直到它开始在
:=
期间抛出意外错误)。 Try this: 尝试这个:
data.table(t(sapply(employees, function(x) unlist(lapply(x, function(x) ifelse(is.null(x),NA,x))))))
I believe it works fine, but I am not sure if it will suffer from slowness and can be optimized further. 我相信它工作正常,但我不确定它是否会受到缓慢的影响并且可以进一步优化。
I often find do.call()
functions hard to read. 我经常发现
do.call()
函数难以阅读。 A solution I use daily (with a MySQL output containing "NULL"
character values): 我每天使用的解决方案(MySQL输出包含
"NULL"
字符值):
NULL2NA <- function(df) {
df[, 1:length(df)][df[, 1:length(df)] == 'NULL'] <- NA
return(df)
}
But for all solutions: please remember that NA
cannot be used for calculation without na.rm = TRUE
, but with NULL
you can. 但是对于所有解决方案:请记住,如果没有
na.rm = TRUE
, NA
不能用于计算,但是你可以使用NULL
。 NaN
gives the same problem. NaN
给出了同样的问题。 For example: 例如:
> mean(c(1, 2, 3))
2
> mean(c(1, 2, NA, 3))
NA
> mean(c(1, 2, NULL, 3))
2
> mean(c(1, 2, NaN, 3))
NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.