For HTML content retrieved via R, I wonder what (other) options I have with respect to either
file compression (maximum compression rate / minimum file size; the time it takes to compress is of secondary importance) when saving the content to disk
most efficiently storing the content (by whatever means, OS filesystem or DBMS)
My current findings are that gzfile
offers the best compression rate in R. Can I do better? For example, I tried getting rid of unncessary space in the HTML code before saving, but seems like gzfile
already takes care of that as I don't end up with smaller file sizes in comparison.
Extended curiosity question:
How do search engines handle this problem? Or are they throwing away the code as soon as it has been indexed and thus something like this is not relevant for them?
Getting example HTML code:
url_current <- "http://cran.at.r-project.org/web/packages/available_packages_by_name.html"
html <- readLines(url(url_current))
Saving to disk:
path_txt <- file.path(tempdir(), "test.txt")
path_gz <- gsub("\\.txt$", ".gz", path_txt)
path_rdata <- gsub("\\.txt$", ".rdata", path_txt)
path_rdata_2 <- gsub("\\.txt$", "_raw.rdata", path_txt)
write(html, file=path_txt)
write(html, file=gzfile(path_gz, "w"))
save(html, file=path_rdata)
html_raw <- charToRaw(paste(html, collapse="\n"))
save(html_raw, file=path_rdata_2)
Trying to remove unncessary whitespace:
html_2 <- gsub("(>)\\s*(<)", "\\1\\2",html)
path_gz_2 <- gsub("\\.txt$", "_2.gz", path_txt)
write(html_2, gzfile(path_gz_2, "w"))
html_2 <- gsub("\\n", "", html_2)
path_gz_3 <- gsub("\\.txt$", "_3.gz", path_txt)
write(html_2, gzfile(path_gz_3, "w"))
Resulting file sizes:
files <- list.files(dirname(path_txt), full.names=TRUE)
fsizes <- file.info(files)$size
names(fsizes) <- sapply(files, basename)
> fsizes
test.gz test.rdata test.txt test_2.gz test_3.gz
164529 183818 849647 164529 164529
test_raw.rdata
164608
Checking validity of processed HTML code:
require("XML")
html_parsed <- htmlParse(html)
> xpathSApply(html_parsed, "//a[. = 'devtools']", xmlAttrs)
href
"../../web/packages/devtools/index.html"
## >> Valid HTML
html_2_parsed <- htmlParse(readLines(gzfile(path_gz_2)))
> xpathSApply(html_2_parsed, "//a[. = 'devtools']", xmlAttrs)
href
"../../web/packages/devtools/index.html"
## >> Valid HTML
html_3_parsed <- htmlParse(readLines(gzfile(path_gz_3)))
> xpathSApply(html_3_parsed, "//a[. = 'devtools']", xmlAttrs)
href
"../../web/packages/devtools/index.html"
## >> Valid HTML
html_2 <- gsub(">\\s*<", "", html)
strips away the >
and <
Instead try:
html_2 <- gsub("(>)\\s*(<)", "\\1\\2",html)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.