将字符串分成两列

Question

I am working with someone else's data with a column for which possible values are "short" and "long". 我正在使用其他人的数据，并且其列的可能值为“短”和“长”。 Unfortunately, the data creator also added letters and question marks after those words to annotate certain things, that I want to split into a separate column. 不幸的是，数据创建者还在这些词之后添加了字母和问号来注释某些事情，我想将它们分解为单独的列。 Here's some fake data to work with: 以下是一些可使用的伪造数据：

vars <- c('long','short','longG','short?','short?F','long?G')
species <- c('sp1','sp2','sp3','sp4','sp5','sp6')
testdf <- cbind(vars, species)

I would like to split the vars column into the actual value long or short , and a new column with just the annotated characters. 我想将vars列拆分为long或short的实际值，以及一个仅包含带注释字符的新列。 I can get halfway there with the following, which correctly produces a new column with just the annotating characters: 我可以使用以下命令中途到达，它可以正确地产生一个仅包含注释字符的新列：

testdf %>% mutate(notes = gsub('long|short',"",vars)

But I can't figure out how to split or subset var such that I get a column that just says short or long . 但是我不知道如何拆分或子集var这样我得到的列只说short或long 。

Thanks in advance for the help, SO community! 预先感谢您的帮助，SO社区！ ^_^ ^ _ ^

Answer 1

It's difficult to extract pieces of a string in base R. Using stringr instead: 在base R中提取字符串很难。使用stringr代替：

library(stringr)
str_extract(vars, 'long|short')
# [1] "long"  "short" "long"  "short" "short" "long"

(You can use it in mutate or however else.) （您可以在mutate或其他方式中使用它。）

So your complete example (I would parametrize the pattern for good measure) 因此，您的完整示例（我将对参数进行参数化以取得良好的效果）

pattern = "long|short"
mutate(testdf,
   notes = gsub(pattern, "", vars),
   notes2 = str_replace(vars, pattern, ""), # stringr alternative for consistent syntax
   ls = str_extract(vars, pattern))

Answer 2

testdf in the question is a matrix so convert it to a data frame with one of these two alternatives: 问题中的testdf是矩阵，因此可以使用以下两种选择之一将其转换为数据帧：

1) sub a mutate with two sub invocations with the same pattern pat but with different replacements. 1）使用两个具有相同模式pat但替换不同的sub调用的子 mutate 。

pat <- "(long|short)(.*)"
testdf %>% 
       as.data.frame %>%
       mutate(notes = sub(pat, "\\2", vars), 
              vars = sub(pat, "\\1", vars))

giving: 赠送：

   vars  species notes
1  long      sp1      
2 short      sp2      
3  long      sp3     G
4 short      sp4     ?
5 short      sp5    ?F
6  long      sp6    ?G

2) separate Insert a semicolon (or other character) after long or short and then use separate from tidyr. 2）分隔符在长或短后插入分号（或其他字符），然后separate符separate使用。 Note that this works even if the notes contain a semicolon since it only splits at the first semicolon. 请注意，即使音符包含分号也可以使用，因为它仅在第一个分号处分开。

library(tidyr)

testdf %>% 
       as.data.frame %>%
       mutate(vars = sub("(long|short)", "\\1;", vars)) %>%
       separate(vars, c("vars", "notes"), sep = ";", extra = "merge")

giving: 赠送：

   vars notes  species
1  long            sp1
2 short            sp2
3  long     G      sp3
4 short     ?      sp4
5 short    ?F      sp5
6  long    ?G      sp6

Note that if there is always a ? 请注意，如果总有一个？ separating the notes then it could be reduced to: 分离注释，然后可以简化为：

testdf %>% 
       as.data.frame %>%
       separate(vars, c("vars", "notes"), sep = "\\?", extra = "merge")

将字符串分成两列

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-12-04 17:26:56

解决方案2
2 2017-12-04 17:38:44

将字符串分成两列

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-12-04 17:26:56

解决方案2 2 2017-12-04 17:38:44

解决方案1
2 已采纳 2017-12-04 17:26:56

解决方案2
2 2017-12-04 17:38:44