保留R语料库中的确切单词

Question

从发布的答案中：@MrFlick使用R语料库保存文档ID

我正在尝试稍作修改，这是一个很好的例子。

问题：如何修改content_transformer函数以仅保留确切的单词？ 您可以在检查输出中看到，奇妙被视为奇迹，比率被视为基本原理。 我对gregexpr和regmatches了解。

创建数据框：

dd <- data.frame(
  id = 10:13,
  text = c("No wonderful, then, that ever",
           "So that in many cases such a ",
           "But there were still other and",
           "Not even at the rationale")
  , stringsAsFactors = F
)

现在，为了从data.frame中读取特殊属性，我们将使用readTabular函数来创建自己的自定义data.frame阅读器

library(tm)
myReader <- readTabular(mapping = list(content = "text", id = "id"))

指定要用于内容的列和data.frame中的ID。 现在，我们使用DataframeSource读取它，但使用我们的自定义阅读器。

tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))

现在，如果我们只想保留一组特定的单词，则可以创建我们自己的content_transformer函数。 一种方法是

  keepOnlyWords <- content_transformer(function(x, words) {
        regmatches(x, 
            gregexpr(paste0("\\b(",  paste(words, collapse = "|"), "\\b)"), x)
        , invert = T) <- " "
        x
    })

这会将单词列表中没有的所有内容替换为空格。 请注意，您可能要在此之后运行stripWhitespace 。 因此我们的转换看起来像

keep <- c("wonder", "then", "that", "the")

tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)

检查dtm矩阵：

> inspect(dtm)
<<DocumentTermMatrix (documents: 4, terms: 4)>>
Non-/sparse entries: 7/9
Sparsity           : 56%
Maximal term length: 6
Weighting          : term frequency (tf)

    Terms
Docs ratio that the wonder
  10     0    1   1      1
  11     0    1   0      0
  12     0    0   1      0
  13     1    0   1      0

Answer 1

将语法切换为tidytext ，您当前的转换将是

library(tidyverse)
library(tidytext)
library(stringr)

dd %>% unnest_tokens(word, text) %>% 
    mutate(word = str_replace_all(word, setNames(keep, paste0('.*', keep, '.*')))) %>% 
    inner_join(data_frame(word = keep))

##   id   word
## 1 10 wonder
## 2 10    the
## 3 10   that
## 4 11   that
## 5 12    the
## 6 12    the
## 7 13    the

保持完全匹配更加容易，因为您可以使用联接（使用== ）代替正则表达式：

dd %>% unnest_tokens(word, text) %>% 
    inner_join(data_frame(word = keep))

##   id word
## 1 10 then
## 2 10 that
## 3 11 that
## 4 13  the

要将其带回到文档术语矩阵中，

library(tm)

dd %>% mutate(id = factor(id)) %>%    # to keep empty rows of DTM
    unnest_tokens(word, text) %>% 
    inner_join(data_frame(word = keep)) %>% 
    mutate(i = 1) %>% 
    cast_dtm(id, word, i) %>% 
    inspect()

## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity           : 67%
## Maximal term length: 4
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs then that the
##   10    1    1   0
##   11    0    1   0
##   12    0    0   0
##   13    0    0   1

当前，您的函数是将带有边界的words匹配之前或之后。 要将其更改为之前和之后，请将collapse参数更改为包含边界：

tm <- VCorpus(DataframeSource(dd), readerControl = list(reader = myReader))

keepOnlyWords<-content_transformer(function(x,words) {
        regmatches(x, 
            gregexpr(paste0("(\\b",  paste(words, collapse = "\\b|\\b"), "\\b)"), x)
        , invert = T) <- " "
        x
    })

tm <- tm_map(tm, content_transformer(tolower))
tm <- tm_map(tm, keepOnlyWords, keep)
tm <- tm_map(tm, stripWhitespace)

inspect(DocumentTermMatrix(tm))

## <<DocumentTermMatrix (documents: 4, terms: 3)>>
## Non-/sparse entries: 4/8
## Sparsity           : 67%
## Maximal term length: 4
## Weighting          : term frequency (tf)
## 
##     Terms
## Docs that the then
##   10    1   0    1
##   11    1   0    0
##   12    0   0    0
##   13    0   1    0

Answer 2

我得到的结果与带有tm的@alistaire相同，在keepOnlyWords内容转换器中的以下修改的行首先由@BEMR定义：

gregexpr(paste0("\\b(",  paste(words, collapse = "|"), ")\\b"), x)

由@BEMR首先指定的gregexpr中存在放错位置的“）”，即应为“）\\\\ b”而不是“ \\\\ b）”

我认为上述gregexpr与@alistaire指定的等效：

gregexpr(paste0("(\\b",  paste(words, collapse = "\\b|\\b"), "\\b)"), x)

保留R语料库中的确切单词

问题描述

2 个解决方案

解决方案1
2 已采纳 2016-12-02 15:56:45

解决方案2
1 2017-09-18 04:33:54

保留R语料库中的确切单词

问题描述

2 个解决方案

解决方案1 2 已采纳 2016-12-02 15:56:45

解决方案2 1 2017-09-18 04:33:54

解决方案1
2 已采纳 2016-12-02 15:56:45

解决方案2
1 2017-09-18 04:33:54