简体   繁体   English

R studio-使用grepl()捕获特定字符并在数据框中填充新列

[英]R studio - using grepl() to grab specific characters and populate a new column in the dataframe

I have a data set in R studio (Aud) that looks like the following. 我在R studio(Aud)中有一个数据集,如下所示。 ID is of type Character and Function is of type character as well ID的类型为Character,Function的类型也为character

ID                         Function
F04                        FZ000TTY WB002FR088DR011
F05                        FZ000AGH WZ004ABD
F06                        FZ0005ABD

my goal is to attempt and extract only the "FZ", "TTY", "WB", "FR", "WZ", "ABD" from all the rows in the data set and place them in a new unique column in the data set so that i have something like the following as an example 我的目标是尝试从数据集中的所有行中仅提取“ FZ”,“ TTY”,“ WB”,“ FR”,“ WZ”,“ ABD”,并将其放置在数据集,以便我有类似以下内容的示例

ID     Function                  SUBFUN1  SUBFUN2  SUBFUN3  SUBFUN4 SUBFUN5
F04    FZ000TTY WB002FR088DR011  FZ       TTY      WB       FR      DR

I want to individualize the functions since they represent a certain behavior and that way i can plot per ID the behavior or functions which occur the most over a course of time 我想对功能进行个性化设置,因为它们代表某种行为,这样我就可以按ID绘制一段时间内出现次数最多的行为或功能

I tried the the following 我尝试了以下

Aud$Subfun1<-
ifelse(grepl("FZ",Aud$Functions.NO.)==T,"FZ", "Other"))

Aud$Subfun2<-
ifelse(grepl("TTY",Aud$Functions.NO.)==T,"TTY","Other"))

I get the error message below in my attempts for subfun1 & subfun2: 我在尝试subfun1和subfun2时收到以下错误消息:

Error in `$<-.data.frame`(`*tmp*`, Subfun1, value = logical(0)) : 
  replacement has 0 rows, data has 343456

 Error in `$<-.data.frame`(`*tmp*`, Subfun2, value = logical(0)) : 
      replacement has 0 rows, data has 343456

I also tried substring() but substring seems to require a start and an end for the character range that needs to be captured in the new column. 我也尝试了substring(),但是子字符串似乎需要一个在新列中捕获的字符范围的开始和结束。 This is not ideal as the codes FZ, TTY, WB, FR, WZ and ABD all appear at different parts of the function string 这是不理想的,因为代码FZ,TTY,WB,FR,WZ和ABD都出现在功能字符串的不同部分

Any help would be greatly appreciated with this 任何帮助将不胜感激与此

Using data.table: 使用data.table:

library(data.table)
Aud <- data.frame(
  ID = c("F04", "F05", "F06"), 
  Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD"),
  stringsAsFactors = FALSE
)
setDT(Aud)

cbind(Aud, Aud[, tstrsplit(Function, "[0-9]+| ")])
    ID                 Function V1  V2   V3   V4   V5
1: F04 FZ000TTY WB002FR088DR011 FZ TTY   WB   FR   DR
2: F05        FZ000AGH WZ004ABD FZ AGH   WZ  ABD <NA>
3: F06                FZ0005ABD FZ ABD <NA> <NA> <NA>

Staying in base R one could do something like the following: 停留在基数R中可以做以下事情:

our_split <- strsplit(Aud$Function, "[0-9]+| ")

cbind(
  Aud,
  do.call(rbind, lapply(our_split, "length<-", max(lengths(our_split))))
)

One can use tidyr::separate to divide Function column in multiple columns using regex as separator. 可以使用tidyr::separate使用regex作为分隔符,将Function列划分为多个列。

library(tidyverse)

df %>% 
  separate(Function, into = paste("V",1:5, sep=""),  
           sep = "([^[:alpha:]]+)", fill="right", extra = "drop")

#    ID V1  V2   V3   V4   V5
# 1 F04 FZ TTY   WB   FR   DR
# 2 F05 FZ AGH   WZ  ABD <NA>
# 3 F06 FZ ABD <NA> <NA> <NA>

([^[:alpha:]]+) : Separate on anything other than alphabates ([^[:alpha:]]+) :除alphabates以外的其他任何东西

Data: 数据:

df <- read.table(text=
"ID                         Function
F04                        'FZ000TTY WB002FR088DR011'
F05                        'FZ000AGH WZ004ABD'
F06                        FZ0005ABD",
header = TRUE, stringsAsFactors = FALSE)

A tidyverse way that makes use of stringr::str_extract_all to get a nested list of all occurrences of the search terms, then spreads into the wide format you have as your desired output. 一种使用stringr::str_extract_all来获取所有出现的搜索词的嵌套列表的tidyverse方式,然后扩展为所需格式的宽格式。 If you were extracting any sets of consecutive capital letters, you could use "[AZ]+" as your search term, but since you said it was these specific IDs, you need a more specific search term. 如果要提取任何连续的大写字母集,则可以使用"[AZ]+"作为搜索词,但是由于您说的是这些特定的 ID,因此需要一个更特定的搜索词。 If putting the regex becomes cumbersome, say if you have a vector of many of these IDs, you could paste it together and collapse by | 如果放置正则表达式很麻烦,请说如果您拥有许多这些ID的向量,则可以将其粘贴在一起并按|折叠 .

library(tidyverse)
Aud <- data_frame(
  ID = c("F04", "F05", "F06"), 
  Function = c("FZ000TTY WB002FR088DR011", "FZ000AGH WZ004ABD", "FZ0005ABD")
)

search_terms <- "(FZ|TTY|WB|FR|WZ|ABD)"

Aud %>%
  mutate(code = str_extract_all(Function, search_terms)) %>%
  select(-Function) %>%
  unnest(code) %>%
  group_by(ID) %>%
  mutate(subfun = row_number()) %>%
  spread(key = subfun, value = code, sep = "")
#> # A tibble: 3 x 5
#> # Groups:   ID [3]
#>   ID    subfun1 subfun2 subfun3 subfun4
#>   <chr> <chr>   <chr>   <chr>   <chr>  
#> 1 F04   FZ      TTY     WB      FR     
#> 2 F05   FZ      WZ      ABD     <NA>   
#> 3 F06   FZ      ABD     <NA>    <NA>

Created on 2018-07-11 by the reprex package (v0.2.0). reprex软件包 (v0.2.0)于2018-07-11创建。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM