简体   繁体   中英

How to scrape a table from a website with rvest

I am trying to learn web scraping using R and rvest to pull some info but I can't get it to pull and I'm clearly missing something.

I am trying to pull the table from https://www.wunderground.com/history/monthly/KMCI/date/2014-8 that shows Daily Obsercations. But I can't seem to get it to ready table, tr, td or standard tags that I'm familiar with.

I tried to use rSelenium but when I try the first command I just get "PATH to JAVA not found. Please check JAVA is installed." So trying to only use rvest.

What am I missing here?

Here is the code I have so far if it helps:

library(rvest)

wind_site <- "https://www.wunderground.com/history/monthly/KMCI/date/2014-8"

HTML <- read_html(wind_site)

wind_table_html <- HTML %>% html_nodes("table") %>% html_table()

I have been able to extract the content of the tables with the following code (you need to install docker, see https://docs.docker.com/engine/install/ ):

library(RSelenium)
library(rvest)
url <- "https://www.wunderground.com/history/monthly/KMCI/date/2014-8"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
Sys.sleep(5)
htmltxt <- remDr$getPageSource()[[1]]
Sys.sleep(5)
read_html(htmltxt) %>% html_table()

[1]]
# A tibble: 14 x 6
   X1                      X2    X3      X4    X5      X6     
   <chr>                   <chr> <chr>   <chr> <chr>   <chr>  
 1 Temperature (°F)        Max   Average Min   Polygon NA     
 2 Max Temperature         97    86.32   74    NA      NA     
 3 Avg Temperature         85.46 76.47   63.5  NA      NA     
 4 Min Temperature         78    68.71   58    NA      NA     
 5 Dew Point (°F)          Max   Average Min   Polygon NA     
 6 Dew Point               76    66.82   55    NA      NA     
 7 Precipitation (in)      Max   Average Min   Sum     Polygon
 8 Precipitation           3.47  0.20    0.00  6.28    NA     
 9 Snowdepth               0.00  0.00    0.00  0.00    NA     
10 Wind (mph)              Max   Average Min   Polygon NA     
11 Wind                    30    8.5     0     NA      NA     
12 Gust Wind               49    1.32    0     NA      NA     
13 Sea Level Pressure (in) Max   Average Min   Polygon NA     
14 Sea Level Pressure      29.11 28.88   28.61 NA      NA     

[[2]]
# A tibble: 226 x 551
   X1     X2    X3    X4    X5    X6    X7       X8    X9   X10   X11   X12   X13   X14   X15   X16   X17   X18   X19   X20   X21   X22
   <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1 Time   Temp~ Dew ~ Humi~ Wind~ Pres~ Prec~    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 2 Aug  ~ Aug   1     2     3     4     5         6     7     8     9    10    11    12    13    14    15    16    17    18    19    20
 3 Aug    NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 4 1      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 5 2      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 6 3      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 7 4      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 8 5      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 9 6      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
10 7      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# ... with 216 more rows, and 529 more variables: X23 <int>, X24 <int>, X25 <int>, X26 <int>, X27 <int>, X28 <int>, X29 <int>,
#   X30 <int>, X31 <int>, X32 <int>, X33 <int>, X34 <chr>, X35 <chr>, X36 <chr>, X37 <chr>, X38 <int>, X39 <dbl>, X40 <int>,
#   X41 <int>, X42 <dbl>, X43 <int>, X44 <int>, X45 <dbl>, X46 <int>, X47 <int>, X48 <dbl>, X49 <int>, X50 <int>, X51 <dbl>,
#   X52 <int>, X53 <int>, X54 <dbl>, X55 <int>, X56 <int>, X57 <dbl>, X58 <int>, X59 <int>, X60 <dbl>, X61 <int>, X62 <int>,
#   X63 <dbl>, X64 <int>, X65 <int>, X66 <dbl>, X67 <int>, X68 <int>, X69 <dbl>, X70 <int>, X71 <int>, X72 <dbl>, X73 <int>,
#   X74 <int>, X75 <dbl>, X76 <int>, X77 <int>, X78 <dbl>, X79 <int>, X80 <int>, X81 <dbl>, X82 <int>, X83 <int>, X84 <dbl>,
#   X85 <int>, X86 <int>, X87 <dbl>, X88 <int>, X89 <int>, X90 <dbl>, X91 <int>, X92 <int>, X93 <dbl>, X94 <int>, X95 <int>, ...
# i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

[[3]]
# A tibble: 32 x 1
   X1   
   <chr>
 1 Aug  
 2 1    
 3 2    
 4 3    
 5 4    
 6 5    
 7 6    
 8 7    
 9 8    
10 9    
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

[[4]]
# A tibble: 32 x 3
   X1    X2    X3   
   <chr> <chr> <chr>
 1 Max   Avg   Min  
 2 84    74.4  65   
 3 88    77.0  63   
 4 86    72.1  66   
 5 91    78.9  68   
 6 91    77.8  70   
 7 89    74.2  68   
 8 83    73.3  69   
 9 78    71.6  67   
10 83    74.2  66   
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

[[5]]
# A tibble: 32 x 3
   X1    X2    X3   
   <chr> <chr> <chr>
 1 Max   Avg   Min  
 2 63    61.3  58   
 3 64    61.2  59   
 4 68    63.9  60   
 5 71    66.6  62   
 6 73    70.1  68   
 7 72    68.5  65   
 8 71    68.2  67   
 9 68    66.8  65   
10 70    67.2  65   
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

[[6]]
# A tibble: 32 x 3
   X1    X2    X3   
   <chr> <chr> <chr>
 1 Max   Avg   Min  
 2 84    65.4  46   
 3 87    60.6  37   
 4 96    76.5  49   
 5 84    67.0  45   
 6 100   79.2  52   
 7 97    83.8  55   
 8 97    84.8  63   
 9 97    85.0  68   
10 96    79.7  60   
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

[[7]]
# A tibble: 32 x 3
   X1    X2    X3   
   <chr> <chr> <chr>
 1 Max   Avg   Min  
 2 10    6.0   0    
 3 10    4.9   0    
 4 20    10.0  6    
 5 17    10.1  0    
 6 13    5.7   0    
 7 21    10.9  3    
 8 13    8.0   3    
 9 10    6.5   0    
10 12    4.7   0    
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

[[8]]
# A tibble: 32 x 3
   X1    X2    X3   
   <chr> <chr> <chr>
 1 Max   Avg   Min  
 2 29.0  29.0  28.9 
 3 29.1  29.0  29.0 
 4 29.1  29.0  28.9 
 5 29.0  28.9  28.9 
 6 29.0  28.9  28.9 
 7 28.9  28.9  28.8 
 8 28.8  28.8  28.8 
 9 28.9  28.9  28.8 
10 29.0  28.9  28.9 
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

[[9]]
# A tibble: 32 x 1
   X1   
   <chr>
 1 Total
 2 0.00 
 3 0.00 
 4 0.04 
 5 0.71 
 6 0.00 
 7 0.00 
 8 3.47 
 9 0.00 
10 0.00 
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM