简体   繁体   English

将多个脱机html文件读取到R中的列表

[英]Reading multiple offline html files to a list in R

I have rawdata as 20 offline html files stored in following format 我将原始数据作为20个脱机html文件存储为以下格式

../rawdata/1999_table.html
../rawdata/2000_table.html
../rawdata/2001_table.html
../rawdata/2002_table.html
.
.
../rawdata/2017_table.html

These files contain tables that I am extracting and reshaping to a particular format. 这些文件包含我要提取并重塑为特定格式的表。

I want to read these files at once to a list and process them one by one through a function that I have written. 我想一次将这些文件读到一个列表中,并通过我编写的功能逐个处理它们。

What I tried: I put the names of these files into an Excel file called filestoread.xlsx and used a for loop to load these files using the names mentioned in the sheet. 我尝试的方法:我将这些文件的名称放入一个名为filestoread.xlsx的Excel文件中,并使用for循环使用工作表中提到的名称加载这些文件。 But it doesn't seem to work 但这似乎不起作用

filestoread <- fread("../rawdata/filestoread.csv")

x <- list()
for (i in nrow(filestoread)) {
  x[[i]] <- read_html(paste0("../rawdata/", filestoread[i]))
}

How can this be done? 如何才能做到这一点?

Also, after reading the HTML files I want to extract the tables from them and reshape them using a function I wrote after converting it to a data table. 另外,在读取HTML文件后,我想从它们中提取表并使用我将其转换为数据表后编写的函数对它们进行整形。

My final objective is to rbind all the tables and have a single data table with year wise entries of the tables in the html file. 我的最终目标是重新整理所有表,并在html文件中具有一个按年份逐项输入的数据表。

First save path of your data on one of the following ways. 首先使用以下方法之一保存数据的路径。

Either, hardcoded 要么硬编码

filestoread <- paste0("../rawdata/", 1999:2017, "_table.html")

or reading all html files in the directory 或读取目录中的所有html文件

filestoread <- list.files(path = "../rawdata/", pattern="\\.html$")

Then use lapply() 然后使用lapply()

library(rvest)
lapply(filestoread, function(x) try(read_html(x)))

Note: try() runs the code even when there is a file missing (throwing error). 注意:即使文件丢失(抛出错误), try()也会运行代码。

The second part of your question is a little broad, depends on the content of your files, and there are already some answers, you could consider eg this answer . 问题的第二部分有点宽泛,取决于文件的内容,并且已经有了一些答案,您可以考虑使用此答案 In principle you use a combination of ?html_nodes and ?html_table . 原则上,您可以结合使用?html_nodes?html_table

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM