简体   繁体   English

stemDocument R文本挖掘

[英]stemDocument R text mining

My data is a txt file and looks as follows: 我的数据是一个txt文件,如下所示:
words number_doc 单词number_doc
overwiew 1 覆盖1
client 1 客户1
store 1 商店1
marge 1 破坏1
price 2 价格2
stock 2 库存2
economics 2 经济学2

The numbers of the documents are sorted (from the smallest to the largest). 文件的编号被排序(从最小到最大)。 Now I want for each document all the words that belongs to the document. 现在,我希望每个文档都属于该文档的所有单词。 Now they stand in a column, but I want al the words in a textDocument (from the package tm, because it is neccesary for some functions in that package). 现在它们位于一列中,但是我还是想要textDocument中的单词(来自软件包tm,因为它对于该软件包中的某些功能是必要的)。 I did this as follows: 我这样做如下:

 data <- read.table("poging.txt", header = TRUE)
 data

 doc <- c()
 #I paste all the words from a document together:
 doc[1] <- paste(data[1:4,1], collapse = ' ')
 doc[2] <- paste(data[1:4,1], collapse = ' ')

 #Make a data.frame of it
 doc_df <- data.frame(docs = doc, row.names = 1:2)

 #Install package
 install.packages("tm")
 library(tm)

 #Make a Dataframesource of it so that each row is seen as a document
 ds <- DataframeSource(doc_df)
 inspect(VCorpus(ds))

 #Now I want to stem for example document number 1
 stemDocument(ds[[1]])

But by using ds[[1]] as argument, it doesn't work. 但是通过使用ds[[1]]作为参数,它不起作用。 He can't find document number 1. Can someone help me? 他找不到文件编号1。有人可以帮助我吗?

In the examples om the package tm they use the data crude . 在软件包tm的示例中,他们使用crude数据。 I want that my data is the same format as that from crude . 我希望我的数据与crude数据的格式相同。

Silke 丝丝

stemDocument() is meant to be use with a TextDocument, not a DataSource. stemDocument()用于TextDocument,而不是DataSource。 You want to use the DataSource to create a corpus, then you can extract the documents from there. 您想使用数据源创建一个语料库,然后可以从那里提取文档。

ds <- DataframeSource(doc_df)
corpus <- VCorpus(ds)
stemDocument(corpus[[1]])

Note that stemDocument will return a new document and will not update the corpus permanently. 请注意, stemDocument将返回一个新文档,并且不会永久更新语料库。 So if you wish to do anything with the output, be sure to save it somewhere. 因此,如果您希望对输出执行任何操作,请确保将其保存在某处。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM