简体   繁体   English

如何按R中单列中的字符串标签对行值进行子集化?

[英]How to subset row values by string label in a single column in R?

I have a column that I would like to subset its row value based on the first and last 'string' label in R. The level values are as followed:我有一个列,我想根据 R 中的第一个和最后一个“字符串”标签对其行值进行子集化。级别值如下:

[1] "60022 (Location; 9TH FLOOR; Snacks)"
[3] "60024 (Location; 9TH FLOOR; Lg Snacks)"
[5] "60027 (Location; 9TH FLOOR; Sml Snacks)"

I would like the to pull the # and the last string separated by the ';'.我想拉出# 和最后一个字符串,以';' 分隔。 Is there a function or syntax in R to do this? R 中是否有函数或语法来执行此操作? So remove "Location; 9TH FLOOR" and just keep the last ;所以删除“Location; 9TH FLOOR”并保留最后一个; "" string. ““ 细绳。

I have tried this to pull just the first value but am unable to keep the "snacks" as well with this code:我已经尝试过只提取第一个值,但无法使用以下代码保留“零食”:

#updated_df_2020$Machine <- sub("([A-Za-z]+).*", "\\1", updated_df_2020$Machine) 

End result for each row should be the number (60022 and then Snacks) like so:每行的最终结果应该是数字(60022,然后是零食),如下所示:

[1] "60022 (Snacks)" 
[1] "60024 (Lg Snacks)" 
[1] "60027 (Sml Snacks)" 

If we need to remove the substring, capture the digits ( \\\\d+ ) at the start ( ^ ) of the string, and then capture the non white space ( \\\\S ) that succeeds the ;如果我们需要删除子字符串,请捕获字符串开头( ^ )处的数字( \\\\d+ ),然后捕获 ; 之后的非空格( \\\\S; and zero or more space ( \\\\s* ) and other characters that follows ( .* ) till the ) at the end ( $ ) as second capture group.和零个或多个空格 ( \\\\s* ) 和后面 ( .* ) 直到)末尾的其他字符 ( $ ) 作为第二个捕获组。 In the replacement, specify the backreference ( \\\\1 , \\\\2 ) of the captured group and modify it by adding the (在替换中,指定捕获组的反向引用 ( \\\\1 , \\\\2 ) 并通过添加(

updated_df_2020$Machine <- sub("^(\\d+)\\b.*;\\s*\\b(\\S.*\\))$", 
        "\\1 (\\2", updated_df_2020$Machine)
updated_df_2020$Machine
#[1] "60022 (Snacks)"     "60024 (Lg Snacks)"  "60027 (Sml Snacks)"

If the start of the string is not a digit and still wants to get extract, replace ( (\\\\d+) ) with (\\\\w+)如果字符串的开头不是数字并且仍然想要提取,请将 ( (\\\\d+) ) 替换为(\\\\w+)

data数据

updated_df_2020 <- data.frame(Machine = c("60022 (Location; 9TH FLOOR; Snacks)",
   "60024 (Location; 9TH FLOOR; Lg Snacks)", "60027 (Location; 9TH FLOOR; Sml Snacks)"),
   stringsAsFactors = FALSE)

You could do你可以做

> a <- c("60022 (Location; 9TH FLOOR; Snacks)", "60024 (Location; 9TH FLOOR; Snacks)", "60027 (Location; 9TH FLOOR; Snacks)")
> strs <- strsplit(a, split = " ")
> sapply(strs, function(s) paste(s[1], paste0("(", s[length(s)])))
#
# "60022 (Snacks)" "60024 (Snacks)" "60027 (Snacks)"
#

which is uglier, but i guess a bit easier to understand这是丑陋的,但我想更容易理解

We can extract the number at the begining and everything followed by colon afterwards using sub :我们可以使用sub提取开头的数字和后面跟着冒号的所有内容:

sub("(\\d+).*;(.*)", "\\1 (\\2", x)
#[1] "60022 ( Snacks)"     "60024 ( Lg Snacks)"  "60027 ( Sml Snacks)"

where x is其中 x 是

x <- c("60022 (Location; 9TH FLOOR; Snacks)", 
       "60024 (Location; 9TH FLOOR; Lg Snacks)",
       "60027 (Location; 9TH FLOOR; Sml Snacks)")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM