使用R從沒有空格或定界符的字符串中提取單詞

Question

babybag-嬰兒袋
避難所-庇護所
themoderncornerstore-現代角落商店
漢普頓家庭指南-漢普頓家庭指南

有沒有一種方法可以使用R從沒有空格或其他定界符的字符串中提取單詞？ 我有一個URL列表，我試圖弄清楚URL中包含哪些詞。

input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")

Answer 1

這是一種幼稚的方法，可能會給您帶來啟發，我使用了hunspell庫，但是您可以針對任何字典測試子字符串。

我從右邊開始，嘗試每個子字符串，並保持在詞典中可以找到的最長，然后更改我的開始位置，這太慢了，所以我希望您不要有4百萬。 hampton不在此詞典中，因此對於最后一個詞典而言，它給出的結果不正確：

split_words <- function(x){
  candidate <- x
  words <- NULL
  j <- nchar(x)
  while(j !=0){
    word <- NULL
    for (i in j:1){
      candidate <- substr(x,i,j)
      if(!length(hunspell::hunspell_find(candidate)[[1]])) word <- candidate
    }
    if(is.null(word)) return("")
    words <- c(word,words)
    j <- j-nchar(word)
  }
  words
}


input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")

lapply(input,split_words)
# [[1]]
# [1] "baby" "bag" 
# 
# [[2]]
# [1] "bad"     "shelter"
# 
# [[3]]
# [1] "the"    "modern" "corner" "store" 
# 
# [[4]]
# [1] "h"         "amp"       "ton"       "family"    "guidebook"
#

這是一個快速解決方案，可以將單詞手動添加到字典中：

split_words <- function(x, additional = c("hampton","otherwordstoadd")){
  candidate <- x
  words <- NULL
  j <- nchar(x)
  while(j !=0){
    word <- NULL
    for (i in j:1){
      candidate <- substr(x,i,j)
      if(!length(hunspell::hunspell_find(candidate,ignore = additional)[[1]])) word <- candidate
    }
    if(is.null(word)) return("")
    words <- c(word,words)
    j <- j-nchar(word)
  }
  words
}


input <- c("babybag", "badshelter", "themoderncornerstore", "hamptonfamilyguidebook")

lapply(input,split_words)
# [[1]]
# [1] "baby" "bag" 
# 
# [[2]]
# [1] "bad"     "shelter"
# 
# [[3]]
# [1] "the"    "modern" "corner" "store" 
# 
# [[4]]
# [1] "hampton"   "family"    "guidebook"
#

但是，您可以不做任何模棱兩可的表情。 請注意， "guidebook"在我的輸出中是一句話，因此在您的四個示例中我們已經有了一個極端的案例。

使用R從沒有空格或定界符的字符串中提取單詞

問題描述

1 個解決方案

解決方案1
5 已采納 2018-07-30 23:07:01

使用R從沒有空格或定界符的字符串中提取單詞

問題描述

1 個解決方案

解決方案1 5 已采納 2018-07-30 23:07:01

解決方案1
5 已采納 2018-07-30 23:07:01