繁体   English   中英

如何从R中的数据集中选择多个模式

[英]How to select multiple pattern from dataset in R

我有一个包含电子邮件ID列表的数据集(数据):

email=c("susgho.agency@gmail.com","suagencyter.m@gmail.com",
        "duff.abcnkhgt@gmail.com","ftyhabcdfg@gmail.com",
        "gjhfhg1-ail.com","gjhgkjhgbrt.gh@aol.com")

我还有另一个数据集-(disp)带有模式列表:

pattern=c(".agency",".abc","1-ail.com"))

我想看看模式是否与电子邮件匹配。 预期输出应如下:

email                         pattern
susgho.agency@gmail.com       .agency
suagencyter.m@gmail.com 
duff.abcnkhgt@gmail.com       .abc
ftyhabcdfg@gmail.com    
gjhfhg1-ail.com               1-ail.com
gjhgkjhgbrt.gh@aol.com  

我正在使用for循环,但要花很长时间才能执行。

w <- NULL
for(i in 1:nrow(disp))
{
  w1 <- as.character(disp[i,1])
  w2 <- data[grep(w1, data$email),]
  if(nrow(w2) > 0)
  {
    w2$pattern <- w1
    w <- rbind(w, w2)
  }
  else
    w <- rbind(w, w2)
}

任何帮助将不胜感激。 TIA!

您可以这样做:

df$pattern[max.col(-attr(adist(df2$pattern,df$email,counts = T),'counts')[,,3])] = as.character(df2$pattern)
df
                    email   pattern
1 susgho.agency@gmail.com   .agency
2 suagencyter.m@gmail.com      <NA>
3 duff.abcnkhgt@gmail.com      .abc
4    ftyhabcdfg@gmail.com      <NA>
5         gjhfhg1-ail.com 1-ail.com

或者你可以做

merge(df,stack(setNames(Vectorize(grep)(df2$pattern,df,value=T,fixed=T),df2$pattern)),by.x="email",by.y = "values",all=T)
                    email       ind
1 duff.abcnkhgt@gmail.com      .abc
2    ftyhabcdfg@gmail.com      <NA>
3         gjhfhg1-ail.com 1-ail.com
4 suagencyter.m@gmail.com      <NA>
5 susgho.agency@gmail.com   .agency

数据:

df=read.table(text="email
           susgho.agency@gmail.com
           suagencyter.m@gmail.com
           duff.abcnkhgt@gmail.com
           ftyhabcdfg@gmail.com
           gjhfhg1-ail.com",h=T)

df2=read.table(text=" pattern
              .agency
              .abc
              1-ail.com",h=T)

使用stringr::str_match方法略有不同,尽管您需要首先通过添加stringr::str_match反斜杠前缀来转义pattern字符串中的特殊字符:

email=c("susgho.agency@gmail.com","suagencyter.m@gmail.com",
        "duff.abcnkhgt@gmail.com","ftyhabcdfg@gmail.com",
        "gjhfhg1-ail.com","gjhgkjhgbrt.gh@aol.com")

pattern=c("\\.agency","\\.abc","1\\-ail.com")

data.frame(email, pattern = stringr::str_match(email, paste(pattern, collapse = "|")))

这将产生以下输出:

                    email   pattern
1 susgho.agency@gmail.com   .agency
2 suagencyter.m@gmail.com      <NA>
3 duff.abcnkhgt@gmail.com      .abc
4    ftyhabcdfg@gmail.com      <NA>
5         gjhfhg1-ail.com 1-ail.com
6  gjhgkjhgbrt.gh@aol.com      <NA>

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM