create a Corpus from many html files in R

Question

I would like to create a Corpus for the collection of downloaded HTML files, and then read them in R for future text mining.

Essentially, this is what I want to do:

Create a Corpus from multiple html files.

I tried to use DirSource:

library(tm)
a<- DirSource("C:/test")
b<-Corpus(DirSource(a), readerControl=list(language="eng", reader=readPlain))

but it returns "invalid directory parameters"

Read in html files from the Corpus all at once. Not sure how to do it.
Parse them, convert them to plain text, remove tags. Many people suggested using XML, however, I didn't find a way to process multiple files. They are all for one single file.

Thanks very much.

Answer 1

This should do it. Here I've got a folder on my computer of HTML files (a random sample from SO) and I've made a corpus out of them, then a document term matrix and then done a few trivial text mining tasks.

# get data
setwd("C:/Downloads/html") # this folder has your HTML files 
html <- list.files(pattern="\\.(htm|html)$") # get just .htm and .html files

# load packages
library(tm)
library(RCurl)
library(XML)
# get some code from github to convert HTML to text
writeChar(con="htmlToText.R", (getURL(ssl.verifypeer = FALSE, "https://raw.github.com/tonybreyal/Blog-Reference-Functions/master/R/htmlToText/htmlToText.R")))
source("htmlToText.R")
# convert HTML to text
html2txt <- lapply(html, htmlToText)
# clean out non-ASCII characters
html2txtclean <- sapply(html2txt, function(x) iconv(x, "latin1", "ASCII", sub=""))

# make corpus for text mining
corpus <- Corpus(VectorSource(html2txtclean))

# process text...
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(a, PlainTextDocument)
a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)
a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10))) 
newstopwords <- findFreqTerms(a.dtm1, lowfreq=10) # get most frequent words
# remove most frequent words for this corpus
a.dtm2 <- a.dtm1[!(a.dtm1$dimnames$Terms) %in% newstopwords,] 
inspect(a.dtm2)

# carry on with typical things that can now be done, ie. cluster analysis
a.dtm3 <- removeSparseTerms(a.dtm2, sparse=0.7)
a.dtm.df <- as.data.frame(inspect(a.dtm3))
a.dtm.df.scale <- scale(a.dtm.df)
d <- dist(a.dtm.df.scale, method = "euclidean") 
fit <- hclust(d, method="ward")
plot(fit)

在此输入图像描述

# just for fun... 
library(wordcloud)
library(RColorBrewer)

m = as.matrix(t(a.dtm1))
# get word counts in decreasing order
word_freqs = sort(colSums(m), decreasing=TRUE) 
# create a data frame with words and their frequencies
dm = data.frame(word=names(word_freqs), freq=word_freqs)
# plot wordcloud
wordcloud(dm$word, dm$freq, random.order=FALSE, colors=brewer.pal(8, "Dark2"))

在此输入图像描述

Answer 2

This will correct the error.

 b<-Corpus(a, ## I change DireSource(a) by a
          readerControl=list(language="eng", reader=readPlain))

But I think to read your Html you need to use xml reader. Something like :

r <- Corpus(DirSource('c:\test'),
             readerControl = list(reader = readXML),spec)

But you need to supply the spec argument, which depends with your file structure. see for example readReut21578XML . It is a good example of xml/html parser.

Answer 3

To read all the html files into an R object you can use

# Set variables
folder <- 'C:/test'
extension <- '.htm'

# Get the names of *.html files in the folder
files <- list.files(path=folder, pattern=extension)

# Read all the files into a list
htmls <- lapply(X=files,
                FUN=function(file){
                 .con <- file(description=paste(folder, file, sep='/'))
                 .html <- readLines(.con)
                 close(.con)
                 names(.html)  <- file
                 .html
})

That will give you a list, and each element is the HTML content of each file.

I'll post later on parsing it, I'm in a hurry.

Answer 4

我发现包samppipeR对于仅提取html页面的“核心”文本特别有用。

create a Corpus from many html files in R

Question

4 answers

solution1
13 2013-02-22 07:38:22

solution2
2 2013-02-22 04:12:57

solution3
0 2013-02-22 04:16:55

solution4
0 2015-03-16 11:31:51

create a Corpus from many html files in R

Question

4 answers

solution1 13 2013-02-22 07:38:22

solution2 2 2013-02-22 04:12:57

solution3 0 2013-02-22 04:16:55

solution4 0 2015-03-16 11:31:51

solution1
13 2013-02-22 07:38:22

solution2
2 2013-02-22 04:12:57

solution3
0 2013-02-22 04:16:55

solution4
0 2015-03-16 11:31:51