从R中的许多html文件创建一个语料库

Question

I would like to create a Corpus for the collection of downloaded HTML files, and then read them in R for future text mining. 我想创建一个语料库来收集下载的HTML文件，然后在R中读取它们以供将来的文本挖掘。

Essentially, this is what I want to do: 从本质上讲，这就是我想要做的：

Create a Corpus from multiple html files. 从多个html文件创建语料库。

I tried to use DirSource: 我尝试使用DirSource：

library(tm)
a<- DirSource("C:/test")
b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=readPlain))

but it returns "invalid directory parameters" 但它返回“无效的目录参数”

Read in html files from the Corpus all at once. 一次读入来自语料库的html文件。 Not sure how to do it. 不知道怎么做。
Parse them, convert them to plain text, remove tags. 解析它们，将它们转换为纯文本，删除标签。 Many people suggested using XML, however, I didn't find a way to process multiple files. 很多人建议使用XML，但是，我找不到处理多个文件的方法。 They are all for one single file. 它们都是一个文件。

Thanks very much. 非常感谢。

Answer 1

This should do it. 这应该做到这一点。 Here I've got a folder on my computer of HTML files (a random sample from SO) and I've made a corpus out of them, then a document term matrix and then done a few trivial text mining tasks. 在这里，我的计算机上有一个HTML文件的文件夹（来自SO的随机样本），我从中创建了一个语料库，然后是一个文档术语矩阵，然后完成了一些简单的文本挖掘任务。

# get data
setwd("C:/Downloads/html") # this folder has your HTML files 
html <- list.files(pattern="\\.(htm|html)$") # get just .htm and .html files

# load packages
library(tm)
library(RCurl)
library(XML)
# get some code from github to convert HTML to text
writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")))
source("htmlToText.R")
# convert HTML to text
html2txt <- lapply(html, htmlToText)
# clean out non-ASCII characters
html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub=""))

# make corpus for text mining
corpus <- Corpus(VectorSource(html2txtclean))

# process text...
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(a, PlainTextDocument)
a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)
a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10))) 
newstopwords <- findFreqTerms(a.dtm1, lowfreq=10) # get most frequent words
# remove most frequent words for this corpus
a.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,] 
inspect(a.dtm2)

# carry on with typical things that can now be done, ie. cluster analysis
a.dtm3 <- removeSparseTerms(a.dtm2, sparse=0.7)
a.dtm.df <- as.data.frame(inspect(a.dtm3))
a.dtm.df.scale <- scale(a.dtm.df)
d <- dist(a.dtm.df.scale, method = "euclidean") 
fit <- hclust(d, method="ward")
plot(fit)

在此输入图像描述

# just for fun... 
library(wordcloud)
library(RColorBrewer)

m = as.matrix(t(a.dtm1))
# get word counts in decreasing order
word_freqs = sort(colSums(m), decreasing=TRUE) 
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
# plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

在此输入图像描述

Answer 2

This will correct the error. 这将纠正错误。

 b<-Corpus(a, ## I change DireSource(a) by a
          readerControl=list(language="eng", reader=readPlain))

But I think to read your Html you need to use xml reader. 但我想读你的Html你需要使用xml阅读器。 Something like : 就像是：

r <- Corpus(DirSource('c:\test'),
             readerControl = list(reader = readXML),spec)

But you need to supply the spec argument, which depends with your file structure. 但是你需要提供spec参数，这取决于你的文件结构。 see for example readReut21578XML . 例如，参见readReut21578XML 。 It is a good example of xml/html parser. 这是xml / html解析器的一个很好的例子。

Answer 3

To read all the html files into an R object you can use 要将所有html文件读入R对象，您可以使用

# Set variables
folder <- 'C:/test'
extension <- '.htm'

# Get the names of *.html files in the folder
files <- list.files(path=folder, pattern=extension)

# Read all the files into a list
htmls <- lapply(X=files,
                FUN=function(file){
                 .con <- file(description=paste(folder, file, sep='/'))
                 .html <- readLines(.con)
                 close(.con)
                 names(.html)  <- file
                 .html
})

That will give you a list, and each element is the HTML content of each file. 这将为您提供一个列表，每个元素都是每个文件的HTML内容。

I'll post later on parsing it, I'm in a hurry. 我稍后会解析它，我很着急。

Answer 4

我发现包samppipeR对于仅提取html页面的“核心”文本特别有用。

从R中的许多html文件创建一个语料库

问题描述

4 个解决方案

解决方案1
13 2013-02-22 07:38:22

解决方案2
2 2013-02-22 04:12:57

解决方案3
0 2013-02-22 04:16:55

解决方案4
0 2015-03-16 11:31:51

从R中的许多html文件创建一个语料库

问题描述

4 个解决方案

解决方案1 13 2013-02-22 07:38:22

解决方案2 2 2013-02-22 04:12:57

解决方案3 0 2013-02-22 04:16:55

解决方案4 0 2015-03-16 11:31:51

解决方案1
13 2013-02-22 07:38:22

解决方案2
2 2013-02-22 04:12:57

解决方案3
0 2013-02-22 04:16:55

解决方案4
0 2015-03-16 11:31:51