从R中的语料库中搜索已删除的文档

Question

我想在分析之前对文本进行预处理

mydat

   Production of banners 1,2x2, Cutting
Production of a plate with the size 2330 * 600mm
Delivery
















Placement of advertising information on posters 0.85 * 0.65 at Ordzhonikidze Street (TSUM) -Gerzen, side A2 April 2014
Manufacturing of a banner 3,7х2,7
Placement of advertising information on the prismatron 3 * 4 at 60, Ordzhonikidze, Aldjonikidze Street, A (01.12.2011-14.12.2011)
Placement of advertising information on the multipanel 3 * 12 at Malygina-M.Torez street, side A, (01.12.2011-14.12.2011)
Designer services
41526326

12
Mounting and rolling of the RIM on the prismatron 3 * 6

编码

 mydat=read.csv("C:/kr_csv.csv", sep=";",dec=",")

  tw.corpus <- Corpus(VectorSource(mydat$descr))
  tw.corpus <- tm_map(tw.corpus, removePunctuation)
  tw.corpus <- tm_map(tw.corpus, removeNumbers)
  tw.corpus = tm_map(tw.corpus, content_transformer(tolower))
  tw.corpus = tm_map(tw.corpus, stemDocument)


#deleting emptu documents 

doc.m <- DocumentTermMatrix(tw.corpus)


rowTotals <- apply(doc.m , 1, sum) #Find the sum of words in each Document
doc.m.new   <- doc.m[rowTotals> 0, ]

1.我如何知道在预处理过程中删除的观察数（例如，第一，第二文本被删除）？ 2.这个观测值如何从原始数据集（mydat）中删除？

Answer 1

在预处理并提取了主体之后，您要计算每个文档中剩余的单词数。 当然，其中没有单词的“文档”的计数为零。 另外，仅包含字母和标点符号的文档也为空，因为您删除了这些字符串。

在数据中，您有许多空行的“文档”。 语料库中总共有28个“文档”，但其中有一半以上是空行（即，它们包含零个单词）。

您可以计算rowTotals每个文档的rowTotals 。 如果检查rowTotals中的哪个条目等于零，则将获得随后从doc.m删除的文档编号：

rowTotals
# 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
# 3  5  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10  2  8  8  2  0  0  0  7

您可以看到文档doc.m等都包含零个单词，因此在doc.m不存在。 您可以使用which()自动获取这些数字：

which( rowTotals == 0)
# [1] 4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 25 26 27

从R中的语料库中搜索已删除的文档

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-03-18 13:35:36

从R中的语料库中搜索已删除的文档

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-03-18 13:35:36

解决方案1
1 已采纳 2018-03-18 13:35:36