[英]Add column to data frame based on long list and values in another column is too slow
I am adding a new column to a dataframe using apply() and mutate.我正在使用 apply() 和 mutate 向 dataframe 添加一个新列。 It works.
有用。 Unfortunately, it is very slow.
不幸的是,它非常慢。 I have 24M rows and I am adding column based on values in a long (58 items).
我有 24M 行,我正在根据 long(58 项)中的值添加列。 It was bearable with smaller list.
较小的列表是可以忍受的。 Not anymore.
不再。 Here is my example
这是我的例子
large_df <-data.frame(A=(1:4),
B= c('a','b','c','d'),
C= c('e','f','g','h'))
long_list = c('e','f','g')
large_df =mutate (large_df, new_C = apply(large_df[,2:3], 1,
function(r) any(r %in% long_list)))
The new column (new_C) will read True or False.新列 (new_C) 将读取 True 或 False。 It works but I am looking for a speedy alternative.
它有效,但我正在寻找一个快速的替代方案。
Thank you so much.太感谢了。 Serhiy
谢尔伊
Bonus Q. I couldn't just select one column with in apply(), needed range.奖金 Q. 我不能只 select 一列在 apply() 中,需要范围。 Why?
为什么?
Try one of these alternatives using lapply
:使用
lapply
尝试以下替代方案之一:
large_df$new_c <- Reduce(`|`, lapply(large_df[, 2:3], `%in%`, long_list))
or sapply
:或
sapply
:
large_df$new_c <- rowSums(sapply(large_df[, 2:3], `%in%`, long_list)) > 0
Both of which return:两者都返回:
large_df
# A B C new_c
#1 1 a e TRUE
#2 2 b f TRUE
#3 3 c g TRUE
#4 4 d h FALSE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.