简体   繁体   English

在文本挖掘中将TDM CSV文件转换为语料库格式

[英]Convert TDM CSV file into Corpus Format in Text Mining

I am using tm package for text mining in R. I performed following steps: 我正在使用tm包在R中进行文本挖掘。我执行了以下步骤:

Import the data in R system and Creating Text Corpus 将数据导入R系统并创建文本语料库

dataorg <- read.csv("Report_2014.csv")
corpus <- Corpus(VectorSource(data$Resolution))

Clean the data 清理数据

mystopwords <- c("through","might","much","had","got","with","these")

cleanset <- tm_map(corpus, removeWords, mystopwords)
cleanset <- tm_map(cleanset, tolower)
cleanset <- tm_map(cleanset, removePunctuation)
cleanset <- tm_map(cleanset, removeNumbers)

Creating Term Document Matrix 创建术语文档矩阵

tdm <- TermDocumentMatrix(cleanset)

At this point I export the TDM data into csv in order to perform some manual cleansing of the terms 此时,我将TDM数据导出到csv中,以便对术语进行一些手动清理

write.csv(inspect(tdm), file="tdmfile.csv")

Now the problem is that I want to bring back the cleaned tdm csv file into R system and perform further text analysis like clustering, frequency analysis. 现在的问题是,我想将已清理的tdm csv文件恢复到R系统中,并执行进一步的文本分析,例如聚类,频率分析。 But I am not able to convert the csv file back into corpus format acceptable by tm package algorithms so I am not able to proceed further with my text analysis. 但是我无法将csv文件转换回tm包算法可接受的语料库格式,因此无法进一步进行文本分析。

It would be really helpful if somebody can help me out to convert cleaned csv file into corpus format which is acceptable by text analysis functions of tm package. 如果有人可以帮助我将清除的csv文件转换为corpus格式(这是tm包的文本分析功能可以接受的格式),那将非常有帮助。

First read the csv back into R 首先将csv读回R

df<-read.csv("tdmfile.csv")

Then convert the vector (referenced by the column name) into a corpus 然后将向量(由列名称引用)转换为语料库

corpus<-Corpus(VectorSource(df$column))

If the above doesn't work, try converting the df into utf-8 before the corpus 如果上述方法不起作用,请尝试在语料库之前将df转换为utf-8

convert <- iconv(df,to="utf-8-mac")

you are using keyword Dataorg...but i did n't see anywhere you are mentioning it in your code.... if you want convert your csv file into Corpus Format just fellow this link 您正在使用关键字Dataorg ...但是我在代码中没有看到您提到它的任何地方....如果您想将csv文件转换为Corpus格式,请点击此链接
R text mining documents from CSV file (one row per doc) 来自CSV文件的R文本挖掘文档(每个文档一行)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM