[英]Including short tokens in a tm DocumentTermMatrix
EDIT: This was an issue with objects in the workspace conflicting and causing unexpected behavior. 编辑:这是工作空间中的对象发生冲突并导致意外行为的问题。
I am trying to create a DocumentTermMatrix from a document using the following code. 我正在尝试使用以下代码从文档创建DocumentTermMatrix。 The document contains many 1 and 2-character tokens. 该文档包含许多1个和2个字符的令牌。 However, even when the minimum word length is set to 1 character, the resulting matrix contains 699 documents and 0 terms. 但是,即使最小字长设置为1个字符,结果矩阵也包含699个文档和0个词。
library(tm)
data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",header=FALSE)
data <- data[-1]
training_data <- as.vector(apply(as.matrix(data, mode="character"),1,paste,collapse=" "))
corpus <- Corpus(VectorSource(training_data))
matrix <- DocumentTermMatrix(corpus,control=list(wordLengths=c(1,Inf)))
Can anyone shed some light as to why no tokens are created despite there being many 1 and 2 character tokens in the data? 任何人都可以阐明为什么尽管数据中有许多1和2个字符标记,但为什么没有创建标记吗? Here is one sample data entry: 这是一个示例数据条目:
" 4 8 8 5 4 5 10 4 1 4"
I ran exactly what you gave me in the latest version of R and tm on a windows 7 machine and produced the results you were looking for(see below). 我在Windows 7机器上完全按照您在最新版本的R和tm中给我的方式运行,并生成了所需的结果(请参见下文)。 I'd try clearing your workspace, exiting R and/or rebooting. 我会尝试清除您的工作区,退出R和/或重新启动。
> library(tm)
> data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",header=FALSE)
> data <- data[-1]
>
> training_data <- as.vector(apply(as.matrix(data, mode="character"),1,paste,collapse=" "))
> corpus <- Corpus(VectorSource(training_data))
>
> matrix <- DocumentTermMatrix(corpus,control=list(wordLengths=c(1,Inf)))
> matrix
A document-term matrix (699 documents, 11 terms)
Non-/sparse entries: 2899/4790
Sparsity : 62%
Maximal term length: 2
Weighting : term frequency (tf)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.