简体   繁体   English

在tm DocumentTermMatrix中包含短标记

[英]Including short tokens in a tm DocumentTermMatrix

EDIT: This was an issue with objects in the workspace conflicting and causing unexpected behavior. 编辑:这是工作空间中的对象发生冲突并导致意外行为的问题。

I am trying to create a DocumentTermMatrix from a document using the following code. 我正在尝试使用以下代码从文档创建DocumentTermMatrix。 The document contains many 1 and 2-character tokens. 该文档包含许多1个和2个字符的令牌。 However, even when the minimum word length is set to 1 character, the resulting matrix contains 699 documents and 0 terms. 但是,即使最小字长设置为1个字符,结果矩阵也包含699个文档和0个词。

library(tm)
data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",header=FALSE)
data <- data[-1]

training_data <- as.vector(apply(as.matrix(data, mode="character"),1,paste,collapse=" "))
corpus <- Corpus(VectorSource(training_data))

matrix <- DocumentTermMatrix(corpus,control=list(wordLengths=c(1,Inf)))

Can anyone shed some light as to why no tokens are created despite there being many 1 and 2 character tokens in the data? 任何人都可以阐明为什么尽管数据中有许多1和2个字符标记,但为什么没有创建标记吗? Here is one sample data entry: 这是一个示例数据条目:

" 4  8  8  5  4 5 10  4  1 4"

I ran exactly what you gave me in the latest version of R and tm on a windows 7 machine and produced the results you were looking for(see below). 我在Windows 7机器上完全按照您在最新版本的R和tm中给我的方式运行,并生成了所需的结果(请参见下文)。 I'd try clearing your workspace, exiting R and/or rebooting. 我会尝试清除您的工作区,退出R和/或重新启动。

> library(tm)
> data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",header=FALSE)
> data <- data[-1]
> 
> training_data <- as.vector(apply(as.matrix(data, mode="character"),1,paste,collapse=" "))
> corpus <- Corpus(VectorSource(training_data))
> 
> matrix <- DocumentTermMatrix(corpus,control=list(wordLengths=c(1,Inf)))
> matrix
A document-term matrix (699 documents, 11 terms)

Non-/sparse entries: 2899/4790
Sparsity           : 62%
Maximal term length: 2 
Weighting          : term frequency (tf)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM