[英]split strings into two columns
I am working with someone else's data with a column for which possible values are "short" and "long". 我正在使用其他人的数据,并且其列的可能值为“短”和“长”。 Unfortunately, the data creator also added letters and question marks after those words to annotate certain things, that I want to split into a separate column.
不幸的是,数据创建者还在这些词之后添加了字母和问号来注释某些事情,我想将它们分解为单独的列。 Here's some fake data to work with:
以下是一些可使用的伪造数据:
vars <- c('long','short','longG','short?','short?F','long?G')
species <- c('sp1','sp2','sp3','sp4','sp5','sp6')
testdf <- cbind(vars, species)
I would like to split the vars
column into the actual value long
or short
, and a new column with just the annotated characters. 我想将
vars
列拆分为long
或short
的实际值,以及一个仅包含带注释字符的新列。 I can get halfway there with the following, which correctly produces a new column with just the annotating characters: 我可以使用以下命令中途到达,它可以正确地产生一个仅包含注释字符的新列:
testdf %>% mutate(notes = gsub('long|short',"",vars)
But I can't figure out how to split or subset var
such that I get a column that just says short
or long
. 但是我不知道如何拆分或子集
var
这样我得到的列只说short
或long
。
Thanks in advance for the help, SO community! 预先感谢您的帮助,SO社区! ^_^
^ _ ^
It's difficult to extract pieces of a string in base
R. Using stringr
instead: 在
base
R中提取字符串很难。使用stringr
代替:
library(stringr)
str_extract(vars, 'long|short')
# [1] "long" "short" "long" "short" "short" "long"
(You can use it in mutate
or however else.) (您可以在
mutate
或其他方式中使用它。)
So your complete example (I would parametrize the pattern for good measure) 因此,您的完整示例(我将对参数进行参数化以取得良好的效果)
pattern = "long|short"
mutate(testdf,
notes = gsub(pattern, "", vars),
notes2 = str_replace(vars, pattern, ""), # stringr alternative for consistent syntax
ls = str_extract(vars, pattern))
testdf
in the question is a matrix so convert it to a data frame with one of these two alternatives: 问题中的
testdf
是矩阵,因此可以使用以下两种选择之一将其转换为数据帧:
1) sub a mutate
with two sub
invocations with the same pattern pat
but with different replacements. 1)使用两个具有相同模式
pat
但替换不同的sub
调用的子 mutate
。
pat <- "(long|short)(.*)"
testdf %>%
as.data.frame %>%
mutate(notes = sub(pat, "\\2", vars),
vars = sub(pat, "\\1", vars))
giving: 赠送:
vars species notes
1 long sp1
2 short sp2
3 long sp3 G
4 short sp4 ?
5 short sp5 ?F
6 long sp6 ?G
2) separate Insert a semicolon (or other character) after long or short and then use separate
from tidyr. 2)分隔符在长或短后插入分号(或其他字符),然后
separate
符separate
使用。 Note that this works even if the notes contain a semicolon since it only splits at the first semicolon. 请注意,即使音符包含分号也可以使用,因为它仅在第一个分号处分开。
library(tidyr)
testdf %>%
as.data.frame %>%
mutate(vars = sub("(long|short)", "\\1;", vars)) %>%
separate(vars, c("vars", "notes"), sep = ";", extra = "merge")
giving: 赠送:
vars notes species
1 long sp1
2 short sp2
3 long G sp3
4 short ? sp4
5 short ?F sp5
6 long ?G sp6
Note that if there is always a ? 请注意,如果总有一个? separating the notes then it could be reduced to:
分离注释,然后可以简化为:
testdf %>%
as.data.frame %>%
separate(vars, c("vars", "notes"), sep = "\\?", extra = "merge")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.