[英]Collating IRC archives into a corpus for text mining
对于抓取部分,这里是一些入门代码。
library(XML)
rootUri <- "http://donttreadonme.co.uk"
doc <- htmlParse(paste0(rootUri, "/rubinius/index.html"))
links <- xpathSApply(doc, "//a/@href")
links <- grep("rubinius/2014", links, value = TRUE)
links <- gsub("..", "", links, fixed = TRUE)
messages <- lapply(links[1:5], function(l) {
doc <- htmlParse(paste0(rootUri, l))
readHTMLTable(doc, which = 1, header = FALSE)
})
messages <- do.call(rbind, messages)
## V1 V2
## href.1 00:33:57 travis-ci
## href.2 05:04:23 travis-ci
## href.3 05:27:44 travis-ci
## href.4 10:00:59 yorickpeterse
## href.5 13:23:36 yorickpeterse
## href.6 13:23:53 yorickpeterse
## V3
## href.1 [travis-ci] rubinius/rubinius/master (fcc5b8c - Brian Shirai): The build passed.
## href.2 [travis-ci] rubinius/rubinius/master (901a6bc - Brian Shirai): The build was broken.
## href.3 [travis-ci] rubinius/rubinius/master (5cffe7b - Brian Shirai): The build was fixed.
## href.4 morning
## href.5 oh RubyGems, why do you need the ext builder during runtime?
## href.6 this better not be because I forgot --rubygems ignore
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.