简体   繁体   English

对R数据帧进行快速测试,以查看一列中的行值是否在数据帧的另一列内

[英]Speedy test on R data frame to see if row values in one column are inside another column in the data frame

I have a data frame of marketing data with 22k records and 6 columns, 2 of which are of interest. 我有一个营销数据的数据框,有22k记录和6列,其中2个是感兴趣的。

  • Variable 变量
  • FO.variable FO.variable

Here's a link with the dput output of a sample of the dataframe: http://dpaste.com/2SJ6DPX 这是与数据帧样本的输出输出的链接: http ://dpaste.com/2SJ6DPX

Please let me know if there's a better way of sharing this data. 如果有更好的方式来分享这些数据,请告诉我。

All I want to do is create an additional binary keep column which should be: 我想要做的就是创建一个额外的二进制保留列,它应该是:

  • 1 if FO.variable is inside Variable 如果FO.variable在Variable内,则为1
  • 0 if FO.Variable is not inside Variable 如果FO.Variable不在变量内,则为0

Seems like a simple thing...in Excel I would just add another column with an "if" formula and then paste the formula down. 看起来像一件简单的事情......在Excel中我只想添加另一个带有“if”公式的列,然后将公式粘贴下来。 I've spent the past hours trying to get this and R and failing. 我花了几个小时试图得到这个和R并失败。

Here's what I've tried: 这是我尝试过的:

  1. Using grepl for pattern matching. 使用grepl进行模式匹配。 I've used grepl before but this time I'm trying to pass a column instead of a string. 我之前使用过grepl,但这次我试图传递一个列而不是一个字符串。 My early attempts failed because I tried to force grepl and ifelse resulting in grepl using the first value in the column instead of the entire thing. 我的早期尝试失败了,因为我试图强制grepl和ifelse使用列中的第一个值而不是整个事物导致grepl。

  2. My next attempt was to use transform and grep based off another post on SO. 我的下一次尝试是使用基于SO的另一篇文章的transform和grep。 I didn't think this would give me my exact answer but I figured it would get me close enough for me to figure it out from there...the code ran for a while than errored because invalid subscript. 我不认为这会给我我的确切答案,但我认为它会让我足够接近我从那里弄清楚...代码运行了一段时间而不是错误,因为无效的下标。

    transform(dd, Keep = FO.variable[sapply(variable, grep, FO.variable)])

  3. My next attempt was to use str_detect, but I don't think this is the right approach because I want the row level value and I think 'any' will literally use any value in the vector? 我的下一次尝试是使用str_detect,但我不认为这是正确的方法,因为我想要行级别值,我认为'any'将逐字地使用向量中的任何值?

    kk <- sapply(dd$variable, function(x) any(sapply(dd$FO.variable, str_detect, string = x)))

  4. EDIT: Just tried a for loop. 编辑:刚试过一个for循环。 I would prefer a vectorized approach but I'm pretty desperate at this point. 我更喜欢矢量化的方法,但我现在非常绝望。 I haven't used for-loops before as I've avoided them and stuck to other solutions. 之前我没有使用过for循环,因为我已经避免了它们并且坚持使用其他解决方案。 It doesn't seem to be working quite right not sure if I screwed up the syntax: 它似乎没有正常工作,不确定我是否搞砸了语法:

for(i in 1:nrow(dd)){ if(dd[i,4] %in% dd[i,2]) dd$test[i] <- 1 }

As I mentioned, my ideal output is an additional column with 1 or 0 if FO.variable was inside variable. 正如我所提到的,如果FO.variable在变量内,我的理想输出是一个额外的列,其中包含1或0。 For example, the first three records in the sample data would be 1 and the 4th record would be zero since "Direct/Unknown" is not within "Organic Search, System Email". 例如,样本数据中的前三个记录将为1,第四个记录将为零,因为“直接/未知”不在“有机搜索,系统电子邮件”中。

A bonus would be if a solution could run fast. 如果解决方案可以快速运行,则会获得奖励。 The apply options were taking a long, long time perhaps because they were looping over every iteration across both columns? 应用选项花了很长时间,也许是因为它们在两个列的每次迭代中循环?

This turned out to not nearly be as simple as I would of thought. 事实证明这并不像我想象的那么简单。 Or maybe it is and I'm just a dunce. 或许它是,而我只是一个笨蛋。 Either way, I appreciate any help on how to best approach this. 无论哪种方式,我都很感激如何最好地解决这个问题。

I would go with a simple mapply in your case, as you correctly said, by row operations will be very slow. 在你的情况下,我会选择一个简单的mapply ,正如你所说,行操作将非常慢。 Also, (as suggested by Martin) setting fixed = TRUE and apriori converting to character will significantly improve performance. 另外,(正如Martin所建议的)设置fixed = TRUE并且先前转换为character将显着提高性能。

transform(dd, Keep = mapply(grepl, 
                            as.character(FO.variable), 
                            as.character(variable), 
                            fixed = TRUE))

#    VisitorIDTrue                        variable value      FO.variable FO.value  Keep
# 22      44888657 Direct / Unknown,Organic Search     1 Direct / Unknown        1  TRUE
# 2       44888657   Direct / Unknown,System Email     1 Direct / Unknown        1  TRUE
# 6       44888657             Direct / Unknown,TV     1 Direct / Unknown        1  TRUE
# 10      44888657     Organic Search,System Email     1 Direct / Unknown        1 FALSE
# 18      44888657               Organic Search,TV     1 Direct / Unknown        1 FALSE
# 14      44888657                 System Email,TV     1 Direct / Unknown        1 FALSE
# 24      44888657 Direct / Unknown,Organic Search     1   Organic Search        1  TRUE
# 4       44888657   Direct / Unknown,System Email     1   Organic Search        1 FALSE
...

I read the data 我读了数据

df = dget("http://dpaste.com/2SJ6DPX.txt")

then split the 'variable' column into its parts and figured out the lengths of each entry 然后将“变量”列拆分为其部分,并计算出每个条目的长度

v = strsplit(as.character(df$variable), ",", fixed=TRUE)
len = lengths(v)    ## sapply(v, length) in R-3.1.3

Then I unlisted v and created an index that maps the unlisted v to the row from which it came from 然后我将v列入未列出状态并创建了一个索引,将未列出的v映射到它来自的行

uv = unlist(v)
idx = rep(seq_along(v), len)

Finally, I found the indexes for which uv was equal to its corresponding entry in FO.variable 最后,我找到了uv等于FO.variable中相应条目的索引

test = (uv == as.character(df$FO.variable)[idx])
df$Keep = FALSE
df$Keep[ idx[test] ] = TRUE

Or combined (it seems more useful to return the logical vector than the modified data.frame, which one could obtain with dd$Keep = f0(dd) ) 或者组合(返回逻辑向量似乎比修改后的data.frame更有用,可以用dd$Keep = f0(dd)

f0 = function(dd) {
    v = strsplit(as.character(dd$variable), ",", fixed=TRUE)
    len = lengths(v)
    uv = unlist(v)
    idx = rep(seq_along(v), len)

    keep = logical(nrow(dd))
    keep[ idx[uv == as.character(dd$FO.variable)[idx]] ] = TRUE
    keep
}

(This could be made faster using the fact that the columns are factors, but maybe that's not intentional?) Compared with (the admittedly simpler and easier to understand) (使用列是因子的事实可以更快地做到这一点,但也许这不是故意的?)与(通常更简单,更容易理解)相比

f1 = function(dd) 
    mapply(grepl, dd$FO.variable, dd$variable, fixed=TRUE)

f1a = function(dd)
    mapply(grepl, as.character(dd$FO.variable), 
           as.character(dd$variable), fixed=TRUE)

f2 = function(dd)
    apply(dd, 1, function(x) grepl(x[4], x[2], fixed=TRUE))

with

> library(microbenchmark)
> identical(f0(df), f1(df))
[1] TRUE
> identical(f0(df), unname(f2(df)))
[1] TRUE
> microbenchmark(f0(df), f1(df), f1a(df), f2(df))
Unit: microseconds
    expr     min       lq      mean   median       uq     max neval
  f0(df)  57.559  64.6940  70.26804  69.4455  74.1035  98.322   100
  f1(df) 573.302 603.4635 625.32744 624.8670 637.1810 766.183   100
 f1a(df) 138.527 148.5280 156.47055 153.7455 160.3925 246.115   100
  f2(df) 494.447 518.7110 543.41201 539.1655 561.4490 677.704   100

Two subtle but important additions during the development of the timings were to use fixed=TRUE in the regular expression, and to coerce the factors to character. 在计时的发展过程中,两个微妙但重要的补充是在正则表达式中使用fixed = TRUE,并强制将这些因素强加给人物。

Here is a data.table approach that I think is very similar in spirit to Martin's: 这是一个data.table方法,我认为它与Martin的精神非常相似:

require(data.table)

dt <- data.table(df)
dt[,`:=`(
    fch = as.character(FO.variable),
    rn  = 1:.N
)]

dt[,keep:=FALSE]
dtvars <- dt[,strsplit(as.character(variable),',',fixed=TRUE),by=rn]
setkey(dt,rn,fch)
dt[dtvars,keep:=TRUE]

dt[,c("fch","rn"):=NULL]

The idea is to 这个想法是

  1. identify all pairs of rn & variable (saved in dtvars ) and 识别所有rnvariable对(保存在dtvars )和
  2. see which of these pairs match with rn & F0.variable pairs (in the original table, dt ). 看看这些对中的F0.variable对与rnF0.variable对匹配(在原始表中, dt )。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用R从一列的字符串中提取特定数字并将其存储在数据框的另一列中? - How to extract a particular number from a string of one column and store it in another column of data frame using R? 提取行名的一部分以在 R 的数据框中创建一个新列 - Extracting parts of a row name to make a new column in a data frame in R R子集一个数据帧,不包括列中的某些值 - R subsetting a Data Frame excluding certain values from Column 从列中提取模式并在 R 数据框中创建一个新模式 - Extract a pattern from column and make a new one in R data frame 从 R 数据框中的字符列中提取 % - Extract % from character column in R data frame 如何grep匹配模式的列并计算这些列的行均值,并将平均值作为新列添加到r中的数据帧? - How to grep columns matching a pattern and calculate the row means of those columns and add the mean values as a new column to the data frame in r? R:将数据框列中的空字符串替换为“0”会导致所有列值都替换为“0” - R: Replacing empty string with “0” in data frame column results in all column values being replaced with “0” 如何检查数据框的字符串列是否与另一个数据框的字符串列匹配? - How to check if a string column of a data frame matches with a string column of another data frame? 使用R中的另一列从data.frame中按行逐行删除字符串 - Removing string out of string rowwise in data.frame using another column in R 数据框列向量处理 - Data frame column vector manipulation
相关标签
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM