简体   繁体   English

R文本挖掘问题

[英]R Text Mining Issue

I am totally new to programming and I am now doing my ResMA where I have started learning R. 我对编程完全陌生,现在在开始学习R的地方做我的ResMA。

I have to do something very very simple now and I seem to fail at some point. 我现在必须做一些非常非常简单的事情,但有时似乎会失败。 I just have to count the graphemes (the letters) in one txt file, nothing else. 我只需要在一个txt文件中计算字素(字母),别无其他。 I am first creating a corpus with TM, I am cleaning it and everything, but when I try to run the frequency analysis of each grapheme, the text is actually not cleaned of punctuation and strange symbols, etc. 我首先使用TM创建语料库,正在清理它以及所有内容,但是当我尝试对每个字素进行频率分析时,实际上并没有清除标点符号和奇怪符号等文本。

The code I am using is this: 我正在使用的代码是这样的:

library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
filePath <- choose.files()
text <- readLines(filePath)
docs <- Corpus(VectorSource(text))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument)
print (docs)

x=strsplit(text, "")

newlist = unlist(x,recursive=F)
freqtab = table(newlist)

print (freqtab)

Ok, so, it is obvious that docs here is totally different than the x down there, but when I try to do the things vice-versa, it is still not working. 好的,很明显,这里的文档与下面的x完全不同,但是当我尝试做相反的事情时,它仍然无法正常工作。

I just need to do this: I am going to school!---> i am going to school----> i- 2 a- 1 m- 1 .... 我只需要这样做:我要去学校!--->我要去学校----> i- 2 a- 1 m- 1 ....

I don't get where my problem is coming from, I will appreciate your help! 我不知道我的问题来自哪里,我将感谢您的帮助!

The problem is that you are not modifying text with all your operations; 问题在于您没有使用所有操作来修改text you are working over docs . 您正在处理docs

Running your code using the simple example in your post as text , 以帖子中的简单示例作为text来运行代码,

text <- "I am going to school!"
docs <- Corpus(VectorSource(text))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
docs <- tm_map(docs, stemDocument)

and then printing the content of docs , all your modifications apply 然后打印docs的内容,所有修改都适用

print(unlist(docs)[1])
    content.content 
"i am go to school" 

although note that because of the stemmer, "going" is transformed to "go". 尽管请注意,由于词干的原因,“ going”被转换为“ go”。

Then you can count the characters as in your original code, 然后,您可以像原始代码中那样计算字符,

x=strsplit(as.character(unlist(docs)[1]), "")
freqtab = table(x[[1]])
print(freqtab)

  a c g h i l m o s t 
4 1 1 1 1 1 1 1 4 1 1

Hope it helps. 希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM