[英]Extracting a string from one column into another in R
I have an example data frame like the one below.我有一个示例数据框,如下所示。
ID ![]() |
File![]() |
---|---|
1 ![]() |
11_213.csv ![]() |
2 ![]() |
13_256.csv ![]() |
3 ![]() |
11_223.csv ![]() |
4 ![]() |
12_389.csv ![]() |
5 ![]() |
14_456.csv ![]() |
6 ![]() |
12_345.csv ![]() |
And I want to add another column based on the string between the underscore and the period to get a data frame that looks something like this.我想根据下划线和句点之间的字符串添加另一列,以获得看起来像这样的数据框。
ID ![]() |
File![]() |
Group![]() |
---|---|---|
1 ![]() |
11_213.csv ![]() |
213 ![]() |
2 ![]() |
13_256.csv ![]() |
256 ![]() |
3 ![]() |
11_223.csv ![]() |
223 ![]() |
4 ![]() |
12_389.csv ![]() |
389 ![]() |
5 ![]() |
14_456.csv ![]() |
456 ![]() |
6 ![]() |
12_345.csv ![]() |
345 ![]() |
I think I need to use the str_extract feature within stringr but I am not sure what notation to use for my pattern.我想我需要在 stringr 中使用 str_extract 功能,但我不确定要为我的模式使用什么符号。 For example when I use:
例如,当我使用:
df <- df %>%
mutate("Group" = str_extract(File, "[^_]+"))
I get the all the information before the underscore like this:我得到下划线之前的所有信息,如下所示:
ID ![]() |
File![]() |
Group![]() |
---|---|---|
1 ![]() |
11_213.csv ![]() |
11 ![]() |
2 ![]() |
13_256.csv ![]() |
13 ![]() |
3 ![]() |
11_223.csv ![]() |
11 ![]() |
4 ![]() |
12_389.csv ![]() |
12 ![]() |
5 ![]() |
14_456.csv ![]() |
14 ![]() |
6 ![]() |
12_345.csv ![]() |
12 ![]() |
But that is not what I want.但这不是我想要的。 What should I use instead of "[^_]+" to get just the stuff between the underscore and the period?
我应该使用什么来代替“[^_]+”来获取下划线和句点之间的内容? Thanks!
谢谢!
We can use a regex lookaround to extract the digits ( \\d+
) that succeeds a _
and precedes a .
我们可以使用正则表达式环视来提取
_
和 a 之前的数字( \\d+
) .
with str_extract
使用
str_extract
library(dplyr)
library(stringr)
df <- df %>%
mutate(Group = str_extract(File, "(?<=_)(\\d+)(?=\\.)")
Or another option is to remove the substring with str_remove
ie to match characters ( .*
) including the _
or ( |
) characters from .
或者另一种选择是使用 str_remove 删除
str_remove
即匹配字符 ( .*
),包括_
或 ( |
) 字符.
onwards ( .
can match any character in regex mode - which is by default, so we escape \\
it for literal matching)之后(
.
可以匹配正则表达式模式下的任何字符 - 这是默认情况下,所以我们转义\\
它以进行文字匹配)
df <- df %>%
mutate(Group = str_remove_all(File, ".*_|\\..*"))
A base R option using gsub
使用
gsub
的基本 R 选项
transform(
df,
Group = gsub(".*_(\\d+)\\..*", "\\1", File)
)
gives给
ID File Group
1 1 11_213.csv 213
2 2 13_256.csv 256
3 3 11_223.csv 223
4 4 12_389.csv 389
5 5 14_456.csv 456
6 6 12_345.csv 345
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.