简体   繁体   English

将字符串分成两列

[英]split strings into two columns

I am working with someone else's data with a column for which possible values are "short" and "long". 我正在使用其他人的数据,并且其列的可能值为“短”和“长”。 Unfortunately, the data creator also added letters and question marks after those words to annotate certain things, that I want to split into a separate column. 不幸的是,数据创建者还在这些词之后添加了字母和问号来注释某些事情,我想将它们分解为单独的列。 Here's some fake data to work with: 以下是一些可使用的伪造数据:

vars <- c('long','short','longG','short?','short?F','long?G')
species <- c('sp1','sp2','sp3','sp4','sp5','sp6')
testdf <- cbind(vars, species)

I would like to split the vars column into the actual value long or short , and a new column with just the annotated characters. 我想将vars列拆分为longshort的实际值,以及一个仅包含带注释字符的新列。 I can get halfway there with the following, which correctly produces a new column with just the annotating characters: 我可以使用以下命令中途到达,它可以正确地产生一个仅包含注释字符的新列:

testdf %>% mutate(notes = gsub('long|short',"",vars)

But I can't figure out how to split or subset var such that I get a column that just says short or long . 但是我不知道如何拆分或子集var这样我得到的列只说shortlong

Thanks in advance for the help, SO community! 预先感谢您的帮助,SO社区! ^_^ ^ _ ^

It's difficult to extract pieces of a string in base R. Using stringr instead: base R中提取字符串很难。使用stringr代替:

library(stringr)
str_extract(vars, 'long|short')
# [1] "long"  "short" "long"  "short" "short" "long" 

(You can use it in mutate or however else.) (您可以在mutate或其他方式中使用它。)


So your complete example (I would parametrize the pattern for good measure) 因此,您的完整示例(我将对参数进行参数化以取得良好的效果)

pattern = "long|short"
mutate(testdf,
   notes = gsub(pattern, "", vars),
   notes2 = str_replace(vars, pattern, ""), # stringr alternative for consistent syntax
   ls = str_extract(vars, pattern))

testdf in the question is a matrix so convert it to a data frame with one of these two alternatives: 问题中的testdf是矩阵,因此可以使用以下两种选择之一将其转换为数据帧:

1) sub a mutate with two sub invocations with the same pattern pat but with different replacements. 1)使用两个具有相同模式pat但替换不同的sub调用的 mutate

pat <- "(long|short)(.*)"
testdf %>% 
       as.data.frame %>%
       mutate(notes = sub(pat, "\\2", vars), 
              vars = sub(pat, "\\1", vars))

giving: 赠送:

   vars  species notes
1  long      sp1      
2 short      sp2      
3  long      sp3     G
4 short      sp4     ?
5 short      sp5    ?F
6  long      sp6    ?G

2) separate Insert a semicolon (or other character) after long or short and then use separate from tidyr. 2)分隔符在长或短后插入分号(或其他字符),然后separateseparate使用。 Note that this works even if the notes contain a semicolon since it only splits at the first semicolon. 请注意,即使音符包含分号也可以使用,因为它仅在第一个分号处分开。

library(tidyr)

testdf %>% 
       as.data.frame %>%
       mutate(vars = sub("(long|short)", "\\1;", vars)) %>%
       separate(vars, c("vars", "notes"), sep = ";", extra = "merge")

giving: 赠送:

   vars notes  species
1  long            sp1
2 short            sp2
3  long     G      sp3
4 short     ?      sp4
5 short    ?F      sp5
6  long    ?G      sp6

Note that if there is always a ? 请注意,如果总有一个? separating the notes then it could be reduced to: 分离注释,然后可以简化为:

testdf %>% 
       as.data.frame %>%
       separate(vars, c("vars", "notes"), sep = "\\?", extra = "merge")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM