[英]Speedy test on R data frame to see if row values in one column are inside another column in the data frame
I have a data frame of marketing data with 22k records and 6 columns, 2 of which are of interest. 我有一个营销数据的数据框,有22k记录和6列,其中2个是感兴趣的。
Here's a link with the dput output of a sample of the dataframe: http://dpaste.com/2SJ6DPX 这是与数据帧样本的输出输出的链接: http ://dpaste.com/2SJ6DPX
Please let me know if there's a better way of sharing this data. 如果有更好的方式来分享这些数据,请告诉我。
All I want to do is create an additional binary keep column which should be: 我想要做的就是创建一个额外的二进制保留列,它应该是:
Seems like a simple thing...in Excel I would just add another column with an "if" formula and then paste the formula down. 看起来像一件简单的事情......在Excel中我只想添加另一个带有“if”公式的列,然后将公式粘贴下来。 I've spent the past hours trying to get this and R and failing.
我花了几个小时试图得到这个和R并失败。
Here's what I've tried: 这是我尝试过的:
Using grepl for pattern matching. 使用grepl进行模式匹配。 I've used grepl before but this time I'm trying to pass a column instead of a string.
我之前使用过grepl,但这次我试图传递一个列而不是一个字符串。 My early attempts failed because I tried to force grepl and ifelse resulting in grepl using the first value in the column instead of the entire thing.
我的早期尝试失败了,因为我试图强制grepl和ifelse使用列中的第一个值而不是整个事物导致grepl。
My next attempt was to use transform and grep based off another post on SO. 我的下一次尝试是使用基于SO的另一篇文章的transform和grep。 I didn't think this would give me my exact answer but I figured it would get me close enough for me to figure it out from there...the code ran for a while than errored because invalid subscript.
我不认为这会给我我的确切答案,但我认为它会让我足够接近我从那里弄清楚...代码运行了一段时间而不是错误,因为无效的下标。
transform(dd, Keep = FO.variable[sapply(variable, grep, FO.variable)])
My next attempt was to use str_detect, but I don't think this is the right approach because I want the row level value and I think 'any' will literally use any value in the vector? 我的下一次尝试是使用str_detect,但我不认为这是正确的方法,因为我想要行级别值,我认为'any'将逐字地使用向量中的任何值?
kk <- sapply(dd$variable, function(x) any(sapply(dd$FO.variable, str_detect, string = x)))
EDIT: Just tried a for loop. 编辑:刚试过一个for循环。 I would prefer a vectorized approach but I'm pretty desperate at this point.
我更喜欢矢量化的方法,但我现在非常绝望。 I haven't used for-loops before as I've avoided them and stuck to other solutions.
之前我没有使用过for循环,因为我已经避免了它们并且坚持使用其他解决方案。 It doesn't seem to be working quite right not sure if I screwed up the syntax:
它似乎没有正常工作,不确定我是否搞砸了语法:
for(i in 1:nrow(dd)){ if(dd[i,4] %in% dd[i,2]) dd$test[i] <- 1 }
As I mentioned, my ideal output is an additional column with 1 or 0 if FO.variable was inside variable. 正如我所提到的,如果FO.variable在变量内,我的理想输出是一个额外的列,其中包含1或0。 For example, the first three records in the sample data would be 1 and the 4th record would be zero since "Direct/Unknown" is not within "Organic Search, System Email".
例如,样本数据中的前三个记录将为1,第四个记录将为零,因为“直接/未知”不在“有机搜索,系统电子邮件”中。
A bonus would be if a solution could run fast. 如果解决方案可以快速运行,则会获得奖励。 The apply options were taking a long, long time perhaps because they were looping over every iteration across both columns?
应用选项花了很长时间,也许是因为它们在两个列的每次迭代中循环?
This turned out to not nearly be as simple as I would of thought. 事实证明这并不像我想象的那么简单。 Or maybe it is and I'm just a dunce.
或许它是,而我只是一个笨蛋。 Either way, I appreciate any help on how to best approach this.
无论哪种方式,我都很感激如何最好地解决这个问题。
I would go with a simple mapply
in your case, as you correctly said, by row operations will be very slow. 在你的情况下,我会选择一个简单的
mapply
,正如你所说,行操作将非常慢。 Also, (as suggested by Martin) setting fixed = TRUE
and apriori converting to character
will significantly improve performance. 另外,(正如Martin所建议的)设置
fixed = TRUE
并且先前转换为character
将显着提高性能。
transform(dd, Keep = mapply(grepl,
as.character(FO.variable),
as.character(variable),
fixed = TRUE))
# VisitorIDTrue variable value FO.variable FO.value Keep
# 22 44888657 Direct / Unknown,Organic Search 1 Direct / Unknown 1 TRUE
# 2 44888657 Direct / Unknown,System Email 1 Direct / Unknown 1 TRUE
# 6 44888657 Direct / Unknown,TV 1 Direct / Unknown 1 TRUE
# 10 44888657 Organic Search,System Email 1 Direct / Unknown 1 FALSE
# 18 44888657 Organic Search,TV 1 Direct / Unknown 1 FALSE
# 14 44888657 System Email,TV 1 Direct / Unknown 1 FALSE
# 24 44888657 Direct / Unknown,Organic Search 1 Organic Search 1 TRUE
# 4 44888657 Direct / Unknown,System Email 1 Organic Search 1 FALSE
...
I read the data 我读了数据
df = dget("http://dpaste.com/2SJ6DPX.txt")
then split the 'variable' column into its parts and figured out the lengths of each entry 然后将“变量”列拆分为其部分,并计算出每个条目的长度
v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v) ## sapply(v, length) in R-3.1.3
Then I unlisted v and created an index that maps the unlisted v to the row from which it came from 然后我将v列入未列出状态并创建了一个索引,将未列出的v映射到它来自的行
uv = unlist(v)
idx = rep(seq_along(v), len)
Finally, I found the indexes for which uv was equal to its corresponding entry in FO.variable 最后,我找到了uv等于FO.variable中相应条目的索引
test = (uv == as.character(df$FO.variable)[idx])
df$Keep = FALSE
df$Keep[ idx[test] ] = TRUE
Or combined (it seems more useful to return the logical vector than the modified data.frame, which one could obtain with dd$Keep = f0(dd)
) 或者组合(返回逻辑向量似乎比修改后的data.frame更有用,可以用
dd$Keep = f0(dd)
)
f0 = function(dd) {
v = strsplit(as.character(dd$variable), ",", fixed=TRUE)
len = lengths(v)
uv = unlist(v)
idx = rep(seq_along(v), len)
keep = logical(nrow(dd))
keep[ idx[uv == as.character(dd$FO.variable)[idx]] ] = TRUE
keep
}
(This could be made faster using the fact that the columns are factors, but maybe that's not intentional?) Compared with (the admittedly simpler and easier to understand) (使用列是因子的事实可以更快地做到这一点,但也许这不是故意的?)与(通常更简单,更容易理解)相比
f1 = function(dd)
mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)
f1a = function(dd)
mapply(grepl, as.character(dd$FO.variable),
as.character(dd$variable), fixed=TRUE)
f2 = function(dd)
apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))
with 同
> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
expr min lq mean median uq max neval
f0(df) 57.559 64.6940 70.26804 69.4455 74.1035 98.322 100
f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183 100
f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115 100
f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704 100
Two subtle but important additions during the development of the timings were to use fixed=TRUE in the regular expression, and to coerce the factors to character. 在计时的发展过程中,两个微妙但重要的补充是在正则表达式中使用fixed = TRUE,并强制将这些因素强加给人物。
Here is a data.table approach that I think is very similar in spirit to Martin's: 这是一个data.table方法,我认为它与Martin的精神非常相似:
require(data.table)
dt <- data.table(df)
dt[,`:=`(
fch = as.character(FO.variable),
rn = 1:.N
)]
dt[,keep:=FALSE]
dtvars <- dt[,strsplit(as.character(variable),',',fixed=TRUE),by=rn]
setkey(dt,rn,fch)
dt[dtvars,keep:=TRUE]
dt[,c("fch","rn"):=NULL]
The idea is to 这个想法是
rn
& variable
(saved in dtvars
) and rn
& variable
对(保存在dtvars
)和 rn
& F0.variable
pairs (in the original table, dt
). F0.variable
对与rn
和F0.variable
对匹配(在原始表中, dt
)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.