如何按R中单列中的字符串标签对行值进行子集化？

Question

I have a column that I would like to subset its row value based on the first and last 'string' label in R. The level values are as followed:我有一个列，我想根据 R 中的第一个和最后一个“字符串”标签对其行值进行子集化。级别值如下：

[1] "60022 (Location; 9TH FLOOR; Snacks)"
[3] "60024 (Location; 9TH FLOOR; Lg Snacks)"
[5] "60027 (Location; 9TH FLOOR; Sml Snacks)"

I would like the to pull the # and the last string separated by the ';'.我想拉出# 和最后一个字符串，以';' 分隔。 Is there a function or syntax in R to do this? R 中是否有函数或语法来执行此操作？ So remove "Location; 9TH FLOOR" and just keep the last ;所以删除“Location; 9TH FLOOR”并保留最后一个； "" string. ““ 细绳。

I have tried this to pull just the first value but am unable to keep the "snacks" as well with this code:我已经尝试过只提取第一个值，但无法使用以下代码保留“零食”：

#updated_df_2020$Machine <- sub("([A-Za-z]+).*", "\\1", updated_df_2020$Machine)

End result for each row should be the number (60022 and then Snacks) like so:每行的最终结果应该是数字（60022，然后是零食），如下所示：

[1] "60022 (Snacks)" 
[1] "60024 (Lg Snacks)" 
[1] "60027 (Sml Snacks)"

Answer 1

If we need to remove the substring, capture the digits ( \\\\d+ ) at the start ( ^ ) of the string, and then capture the non white space ( \\\\S ) that succeeds the ;如果我们需要删除子字符串，请捕获字符串开头（ ^ ）处的数字（ \\\\d+ ），然后捕获 ; 之后的非空格（ \\\\S ） ; and zero or more space ( \\\\s* ) and other characters that follows ( .* ) till the ) at the end ( $ ) as second capture group.和零个或多个空格 ( \\\\s* ) 和后面 ( .* ) 直到)末尾的其他字符 ( $ ) 作为第二个捕获组。 In the replacement, specify the backreference ( \\\\1 , \\\\2 ) of the captured group and modify it by adding the (在替换中，指定捕获组的反向引用 ( \\\\1 , \\\\2 ) 并通过添加(

updated_df_2020$Machine <- sub("^(\\d+)\\b.*;\\s*\\b(\\S.*\\))$", 
        "\\1 (\\2", updated_df_2020$Machine)
updated_df_2020$Machine
#[1] "60022 (Snacks)"     "60024 (Lg Snacks)"  "60027 (Sml Snacks)"

If the start of the string is not a digit and still wants to get extract, replace ( (\\\\d+) ) with (\\\\w+)如果字符串的开头不是数字并且仍然想要提取，请将 ( (\\\\d+) ) 替换为(\\\\w+)

data数据

updated_df_2020 <- data.frame(Machine = c("60022 (Location; 9TH FLOOR; Snacks)",
   "60024 (Location; 9TH FLOOR; Lg Snacks)", "60027 (Location; 9TH FLOOR; Sml Snacks)"),
   stringsAsFactors = FALSE)

Answer 2

You could do你可以做

> a <- c("60022 (Location; 9TH FLOOR; Snacks)", "60024 (Location; 9TH FLOOR; Snacks)", "60027 (Location; 9TH FLOOR; Snacks)")
> strs <- strsplit(a, split = " ")
> sapply(strs, function(s) paste(s[1], paste0("(", s[length(s)])))
#
# "60022 (Snacks)" "60024 (Snacks)" "60027 (Snacks)"
#

which is uglier, but i guess a bit easier to understand这是丑陋的，但我想更容易理解

Answer 3

We can extract the number at the begining and everything followed by colon afterwards using sub :我们可以使用sub提取开头的数字和后面跟着冒号的所有内容：

sub("(\\d+).*;(.*)", "\\1 (\\2", x)
#[1] "60022 ( Snacks)"     "60024 ( Lg Snacks)"  "60027 ( Sml Snacks)"

where x is其中 x 是

x <- c("60022 (Location; 9TH FLOOR; Snacks)", 
       "60024 (Location; 9TH FLOOR; Lg Snacks)",
       "60027 (Location; 9TH FLOOR; Sml Snacks)")

如何按R中单列中的字符串标签对行值进行子集化？

问题描述

3 个解决方案

解决方案1
1 已采纳 2020-03-03 21:06:51

data数据

解决方案2
1 2020-03-03 21:19:29

解决方案3
0 2020-03-04 00:44:14

如何按R中单列中的字符串标签对行值进行子集化？

问题描述

3 个解决方案

解决方案1 1 已采纳 2020-03-03 21:06:51

data数据

解决方案2 1 2020-03-03 21:19:29

解决方案3 0 2020-03-04 00:44:14

解决方案1
1 已采纳 2020-03-03 21:06:51

解决方案2
1 2020-03-03 21:19:29

解决方案3
0 2020-03-04 00:44:14