简体   繁体   中英

Find html table name and scrape in R

I'm trying to scrape a table from a web page that has multiple tables. I'd like to get the "FIPS Codes for the States and the District of Columbia" table from https://www.census.gov/geo/reference/ansi_statetables.html . I think the XML::readHTMLTable() is the right way to go, but when I try the following I get an error:

url = "https://www.census.gov/geo/reference/ansi_statetables.html"
readHTMLTable(url, header = T, stringsAsFactors = F)

named list() Warning message: XML content does not seem to be XML: ' https://www.census.gov/geo/reference/ansi_statetables.html '

This is not surprising, of course, because I'm not giving the function any indication of which table I'd like to read. I've dug around in "Inspect" for quite a while but I'm not connecting dots on how to be more precise. There doesn't seem to be a name or class of the table that is analogous to other examples I've found in documentation or on SO. Thoughts?

Consider using readLines() to scrape the html page content and use result in readHTMLTable() :

url = "https://www.census.gov/geo/reference/ansi_statetables.html"
webpage <- readLines(url)

readHTMLTable(webpage, header = T, stringsAsFactors = F)               # LIST OF 3 TABLES

# $`NULL`
#                    Name FIPS State Numeric Code Official USPS Code
# 1               Alabama                      01                 AL
# 2                Alaska                      02                 AK
# 3               Arizona                      04                 AZ
# 4              Arkansas                      05                 AR
# 5            California                      06                 CA
# 6              Colorado                      08                 CO
# 7           Connecticut                      09                 CT
# 8              Delaware                      10                 DE
# 9  District of Columbia                      11                 DC
# 10              Florida                      12                 FL
# 11              Georgia                      13                 GA
# 12               Hawaii                      15                 HI
# 13                Idaho                      16                 ID
# 14             Illinois                      17                 IL
# ...

For specific dataframe return:

fipsdf <- readHTMLTable(webpage, header = T, stringsAsFactors = F)[[1]]

Another solution using rvest instead of XML is:

require(rvest)
read_html("https://www.census.gov/geo/reference/ansi_statetables.html") %>% 
  html_table %>% .[[1]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM