繁体   English   中英

如何从R中的字符串中提取第n个单词之后的所有单词?

[英]How to extract all words after the nth word from string in R?

我的data.frame的第一列由字符串组成,第二列是唯一键。

我想从每个字符串中提取第n个单词之后的所有单词,如果该字符串具有<= n个单词,则提取整个字符串。

我的data.frame中有超过1万行,并且想知道是否除了使用for循环之外还有一种快速的方法?

谢谢。

怎么样:

# Generate some sample data
library(tidyverse)
df <- data.frame(
    one = c("Entries from row one", "Entries from row two", "Entries from row three"),
    two = runif(3))


# Define function to extract all words after the n=1 word 
# (or return the full string if n > # of words in string)
crop_string <- function(ss, n) {
    lapply(strsplit(as.character(ss), "\\s"), function(v)
        if (length(v) > n) paste(v[(n + 1):length(v)], collapse = " ")
        else paste(v, collapse = " "))
}

# Let's crop strings from column one by removing the first 3 words (n = 3)
n <- 3;
df %>%
    mutate(words_after_n = crop_string(one, n))
#                     one       two words_after_n
#1   Entries from row one 0.5120053           one
#2   Entries from row two 0.1873522           two
#3 Entries from row three 0.0725107         three


# If n > # of words, return the full string
n <- 10;
df %>%
    mutate(words_after_n = crop_string(one, n))
#                     one       two          words_after_n
#1   Entries from row one 0.9363278   Entries from row one
#2   Entries from row two 0.3024628   Entries from row two
#3 Entries from row three 0.6666226 Entries from row three

在这里我使用nchar(),因此使您的数据已转换为字符。

as.character(YOUR_DATA)
as.character(sapply(YOUR_DATA,function(x,y){
if(nchar(x)>=y){
substr(x,y,nchar(x))  
}
else{x}
},y= nth_data_you_want))

数据如下:
“ gene @ seq”
“ Cblb @ TAGTCCCGAAGGCATCCCGA”
“ Fb​​xo27 @ CCCACGTGTTCTCCGGCATC”

“ Fb​​xo11 @ GGAATATACGTCCACGAGAA”

“ Pwp1 @ GCCCGACCCAGGCACCGCCT”

我使用10作为第n个数据,结果是:

“ gene @ seq”
“ CCCGAAGGCATCCCGA”
“ CACGTGTTCTCCGGCATC”

“ AATATACGTCCACGAGAA”

“ GACCCAGGCACCGCCT”

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM