简体   繁体   English

HTML内容的文件压缩和存储

[英]File compression for and storing of HTML content

For HTML content retrieved via R, I wonder what (other) options I have with respect to either 对于通过R检索的HTML内容,我想知道关于这两个方面我有哪些(其他)选项

  1. file compression (maximum compression rate / minimum file size; the time it takes to compress is of secondary importance) when saving the content to disk 将内容保存到磁盘时的文件压缩(最大压缩率/最小文件大小;压缩所需的时间是次要的)

  2. most efficiently storing the content (by whatever means, OS filesystem or DBMS) 最有效地存储内容(通过任何方式,OS文件系统或DBMS)

My current findings are that gzfile offers the best compression rate in R. Can I do better? 我目前的发现是gzfile在R中提供了最佳的压缩率。我能做得更好吗? For example, I tried getting rid of unncessary space in the HTML code before saving, but seems like gzfile already takes care of that as I don't end up with smaller file sizes in comparison. 例如,我尝试在保存之前消除HTML代码中不必要的空间,但是gzfile似乎已经解决了这一问题,因为gzfile ,我最终并没有获得较小的文件大小。

Extended curiosity question: 扩展好奇心问题:

How do search engines handle this problem? 搜索引擎如何处理此问题? Or are they throwing away the code as soon as it has been indexed and thus something like this is not relevant for them? 还是他们在索引被索引后就立即丢弃了代码,因此与他们无关的东西吗?


Illustration 插图

Getting example HTML code: 获取示例HTML代码:

url_current <- "http://cran.at.r-project.org/web/packages/available_packages_by_name.html"
html <- readLines(url(url_current))

Saving to disk: 保存到磁盘:

path_txt        <- file.path(tempdir(), "test.txt")
path_gz         <- gsub("\\.txt$", ".gz", path_txt)
path_rdata      <- gsub("\\.txt$", ".rdata", path_txt)
path_rdata_2    <- gsub("\\.txt$", "_raw.rdata", path_txt)

write(html, file=path_txt)
write(html, file=gzfile(path_gz, "w"))
save(html, file=path_rdata)

html_raw <- charToRaw(paste(html, collapse="\n"))
save(html_raw, file=path_rdata_2)

Trying to remove unncessary whitespace: 尝试删除不必要的空格:

html_2  <- gsub("(>)\\s*(<)", "\\1\\2",html)
path_gz_2   <- gsub("\\.txt$", "_2.gz", path_txt)
write(html_2, gzfile(path_gz_2, "w"))
html_2  <- gsub("\\n", "", html_2)
path_gz_3   <- gsub("\\.txt$", "_3.gz", path_txt)
write(html_2, gzfile(path_gz_3, "w"))

Resulting file sizes: 产生的文件大小:

files   <- list.files(dirname(path_txt), full.names=TRUE)
fsizes  <- file.info(files)$size
names(fsizes) <- sapply(files, basename)

> fsizes
       test.gz     test.rdata       test.txt      test_2.gz      test_3.gz 
        164529         183818         849647         164529         164529 
test_raw.rdata 
        164608 

Checking validity of processed HTML code: 检查已处理的HTML代码的有效性:

require("XML")
html_parsed <- htmlParse(html)

> xpathSApply(html_parsed, "//a[. = 'devtools']", xmlAttrs)
                                    href 
"../../web/packages/devtools/index.html" 
## >> Valid HTML

html_2_parsed <- htmlParse(readLines(gzfile(path_gz_2)))

> xpathSApply(html_2_parsed, "//a[. = 'devtools']", xmlAttrs)
                                    href 
"../../web/packages/devtools/index.html" 
## >> Valid HTML

html_3_parsed <- htmlParse(readLines(gzfile(path_gz_3)))

> xpathSApply(html_3_parsed, "//a[. = 'devtools']", xmlAttrs)
                                    href 
"../../web/packages/devtools/index.html" 
## >> Valid HTML
html_2 <- gsub(">\\s*<", "", html)

strips away the > and < 去除><

Instead try: 而是尝试:

html_2 <- gsub("(>)\\s*(<)", "\\1\\2",html)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM