简体   繁体   English

截断R中字符向量的每个元素内的单词

[英]Truncate words within each element of a character vector in R

I have a data frame where one column is a character vector and every element in the vector is the full text of a document. 我有一个数据框,其中一列是字符向量,向量中的每个元素都是文档的全文。 I want to truncate words in each element so that maximum word length is 5 characters. 我想截断每个元素中的单词,以使最大单词长度为5个字符。

For example: 例如:

a <- c(1, 2)
b <- c("Words longer than five characters should be truncated",
       "Words shorter than five characters should not be modified")
df <- data.frame("file" = a, "text" = b, stringsAsFactors=FALSE)

head(df)
  file                                                      text
1    1     Words longer than five characters should be truncated
2    2 Words shorter than five characters should not be modified

And this is what I'm trying to get: 这就是我想要得到的:

  file                                           text
1    1     Words longe than five chara shoul be trunc
2    2 Words short than five chara shoul not be modif

I've tried using strsplit() and strtrim() to modify each word (based in part on split vectors of words by every n words (vectors are in a list) ): 我尝试使用strsplit()和strtrim()修改每个单词(部分基于每n个单词对单词的分割矢量(矢量在列表中) ):

x <- unlist(strsplit(df$text, "\\s+"))
y <- strtrim(x, 5)
y
[1] "Words" "longe" "than"  "five"  "chara" "shoul" "be"    "trunc" "Words" "short" "than" 
[12] "five"  "chara" "shoul" "not"   "be"    "modif"

But I don't know if that's the right direction, because I ultimately need the words in a data frame associated with the correct row, as shown above. 但是我不知道这是否是正确的方向,因为如上所述,我最终需要与正确的行关联的数据框中的单词。

Is there a way to do this using gsub and regex? 有没有办法使用gsub和regex做到这一点?

If you're looking to utilize gsub to perform this task: 如果您要利用gsub执行此任务:

> df$text <- gsub('(?=\\b\\pL{6,}).{5}\\K\\pL*', '', df$text, perl=T)
> df
#   file                                           text
# 1    1     Words longe than five chara shoul be trunc
# 2    2 Words short than five chara shoul not be modif

You were on the right track. 您在正确的轨道上。 In order for your idea to work, however, you have to do the split/trim/combine for each row separated. 为了使您的想法可行,您必须对每一行进行分割/修剪/合并。 Here's a way to do it. 这是一种方法。 I was very verbose on purpose, to make it clear, but you can obviously use less lines. 为了明确起见,我故意很冗长,但显然可以使用更少的行。

df$text <- sapply(df$text, function(str) {
  str <- unlist(strsplit(str, " "))
  str <- strtrim(str, 5)
  str <- paste(str, collapse = " ")
  str
})

And the output: 并输出:

> df
  file                                           text
1    1     Words longe than five chara shoul be trunc
2    2 Words short than five chara shoul not be modif

The short version is 简短的版本是

df$text <- sapply(df$text, function(str) {
  paste(strtrim(unlist(strsplit(str, " ")), 5), collapse = " ")  
})

Edit: 编辑:

I just realized you asked if it is possible to do this using gsub and regex. 我刚刚意识到您问过是否可以使用gsub和regex做到这一点。 Even though you don't need those for this, it's still possible, but harder to read: 即使您不需要这些,它仍然可能,但是更难于阅读:

df$text <- sapply(df$text, function(str) {
  str <- unlist(strsplit(str, " "))
  str <- gsub("(?<=.{5}).+", "", str, perl = TRUE)
  str <- paste(str, collapse = " ")
  str
})

The regex matches anything that appears after 5 characters and replaces those with nothing. 正则表达式匹配出现在5个字符之后的所有内容,并将其替换为空。 perl = TRUE is necessary to enable the regex lookbehind ( (?<=.{5}) ). perl = TRUE是启用正则表达式后向( (?<=.{5}) )所必需的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM