简体   繁体   中英

File compression for and storing of HTML content

For HTML content retrieved via R, I wonder what (other) options I have with respect to either

  1. file compression (maximum compression rate / minimum file size; the time it takes to compress is of secondary importance) when saving the content to disk

  2. most efficiently storing the content (by whatever means, OS filesystem or DBMS)

My current findings are that gzfile offers the best compression rate in R. Can I do better? For example, I tried getting rid of unncessary space in the HTML code before saving, but seems like gzfile already takes care of that as I don't end up with smaller file sizes in comparison.

Extended curiosity question:

How do search engines handle this problem? Or are they throwing away the code as soon as it has been indexed and thus something like this is not relevant for them?


Illustration

Getting example HTML code:

url_current <- "http://cran.at.r-project.org/web/packages/available_packages_by_name.html"
html <- readLines(url(url_current))

Saving to disk:

path_txt        <- file.path(tempdir(), "test.txt")
path_gz         <- gsub("\\.txt$", ".gz", path_txt)
path_rdata      <- gsub("\\.txt$", ".rdata", path_txt)
path_rdata_2    <- gsub("\\.txt$", "_raw.rdata", path_txt)

write(html, file=path_txt)
write(html, file=gzfile(path_gz, "w"))
save(html, file=path_rdata)

html_raw <- charToRaw(paste(html, collapse="\n"))
save(html_raw, file=path_rdata_2)

Trying to remove unncessary whitespace:

html_2  <- gsub("(>)\\s*(<)", "\\1\\2",html)
path_gz_2   <- gsub("\\.txt$", "_2.gz", path_txt)
write(html_2, gzfile(path_gz_2, "w"))
html_2  <- gsub("\\n", "", html_2)
path_gz_3   <- gsub("\\.txt$", "_3.gz", path_txt)
write(html_2, gzfile(path_gz_3, "w"))

Resulting file sizes:

files   <- list.files(dirname(path_txt), full.names=TRUE)
fsizes  <- file.info(files)$size
names(fsizes) <- sapply(files, basename)

> fsizes
       test.gz     test.rdata       test.txt      test_2.gz      test_3.gz 
        164529         183818         849647         164529         164529 
test_raw.rdata 
        164608 

Checking validity of processed HTML code:

require("XML")
html_parsed <- htmlParse(html)

> xpathSApply(html_parsed, "//a[. = 'devtools']", xmlAttrs)
                                    href 
"../../web/packages/devtools/index.html" 
## >> Valid HTML

html_2_parsed <- htmlParse(readLines(gzfile(path_gz_2)))

> xpathSApply(html_2_parsed, "//a[. = 'devtools']", xmlAttrs)
                                    href 
"../../web/packages/devtools/index.html" 
## >> Valid HTML

html_3_parsed <- htmlParse(readLines(gzfile(path_gz_3)))

> xpathSApply(html_3_parsed, "//a[. = 'devtools']", xmlAttrs)
                                    href 
"../../web/packages/devtools/index.html" 
## >> Valid HTML
html_2 <- gsub(">\\s*<", "", html)

strips away the > and <

Instead try:

html_2 <- gsub("(>)\\s*(<)", "\\1\\2",html)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM