File compression for and storing of HTML content

Question

For HTML content retrieved via R, I wonder what (other) options I have with respect to either

file compression (maximum compression rate / minimum file size; the time it takes to compress is of secondary importance) when saving the content to disk
most efficiently storing the content (by whatever means, OS filesystem or DBMS)

My current findings are that gzfile offers the best compression rate in R. Can I do better? For example, I tried getting rid of unncessary space in the HTML code before saving, but seems like gzfile already takes care of that as I don't end up with smaller file sizes in comparison.

Extended curiosity question:

How do search engines handle this problem? Or are they throwing away the code as soon as it has been indexed and thus something like this is not relevant for them?

Illustration

Getting example HTML code:

url_current <- "http://cran.at.r-project.org/web/packages/available_packages_by_name.html"
html <- readLines(url(url_current))

Saving to disk:

path_txt        <- file.path(tempdir(), "test.txt")
path_gz         <- gsub("\\.txt$", ".gz", path_txt)
path_rdata      <- gsub("\\.txt$", ".rdata", path_txt)
path_rdata_2    <- gsub("\\.txt$", "_raw.rdata", path_txt)

write(html, file=path_txt)
write(html, file=gzfile(path_gz, "w"))
save(html, file=path_rdata)

html_raw <- charToRaw(paste(html, collapse="\n"))
save(html_raw, file=path_rdata_2)

Trying to remove unncessary whitespace:

html_2  <- gsub("(>)\\s*(<)", "\\1\\2",html)
path_gz_2   <- gsub("\\.txt$", "_2.gz", path_txt)
write(html_2, gzfile(path_gz_2, "w"))
html_2  <- gsub("\\n", "", html_2)
path_gz_3   <- gsub("\\.txt$", "_3.gz", path_txt)
write(html_2, gzfile(path_gz_3, "w"))

Resulting file sizes:

files   <- list.files(dirname(path_txt), full.names=TRUE)
fsizes  <- file.info(files)$size
names(fsizes) <- sapply(files, basename)

> fsizes
       test.gz     test.rdata       test.txt      test_2.gz      test_3.gz 
        164529         183818         849647         164529         164529 
test_raw.rdata 
        164608

Checking validity of processed HTML code:

require("XML")
html_parsed <- htmlParse(html)

> xpathSApply(html_parsed, "//a[. = 'devtools']", xmlAttrs)
                                    href 
"../../web/packages/devtools/index.html" 
## >> Valid HTML

html_2_parsed <- htmlParse(readLines(gzfile(path_gz_2)))

> xpathSApply(html_2_parsed, "//a[. = 'devtools']", xmlAttrs)
                                    href 
"../../web/packages/devtools/index.html" 
## >> Valid HTML

html_3_parsed <- htmlParse(readLines(gzfile(path_gz_3)))

> xpathSApply(html_3_parsed, "//a[. = 'devtools']", xmlAttrs)
                                    href 
"../../web/packages/devtools/index.html" 
## >> Valid HTML

Answer 1

html_2 <- gsub(">\\s*<", "", html)

strips away the > and <

Instead try:

html_2 <- gsub("(>)\\s*(<)", "\\1\\2",html)

File compression for and storing of HTML content

Question

Illustration

1 answers

solution1
1 2014-05-14 14:37:21

File compression for and storing of HTML content

Question

Illustration

1 answers

solution1 1 2014-05-14 14:37:21

solution1
1 2014-05-14 14:37:21