繁体   English   中英

如何使用 rvest 从网站上抓取表格

[英]How to scrape a table from a website with rvest

我正在尝试使用 R 学习 web 抓取,并尝试获取一些信息,但我无法获取它,而且我显然遗漏了一些东西。

我试图从https://www.wunderground.com/history/monthly/KMCI/date/2014-8中拉出显示每日观察的表格。 但我似乎无法将它放到我熟悉的准备好的 table、tr、td 或标准标签上。

我尝试使用 rSelenium,但是当我尝试第一个命令时,我只是得到“找不到 JAVA 的路径。请检查 JAVA 是否已安装。” 所以试图只使用rvest。

我在这里想念什么?

如果有帮助,这是我到目前为止的代码:

library(rvest)

wind_site <- "https://www.wunderground.com/history/monthly/KMCI/date/2014-8"

HTML <- read_html(wind_site)

wind_table_html <- HTML %>% html_nodes("table") %>% html_table()

我已经能够使用以下代码提取表格的内容(您需要安装 docker,请参阅https://docs.docker.com/engine/install/

library(RSelenium)
library(rvest)
url <- "https://www.wunderground.com/history/monthly/KMCI/date/2014-8"
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(url)
Sys.sleep(5)
htmltxt <- remDr$getPageSource()[[1]]
Sys.sleep(5)
read_html(htmltxt) %>% html_table()

[1]]
# A tibble: 14 x 6
   X1                      X2    X3      X4    X5      X6     
   <chr>                   <chr> <chr>   <chr> <chr>   <chr>  
 1 Temperature (°F)        Max   Average Min   Polygon NA     
 2 Max Temperature         97    86.32   74    NA      NA     
 3 Avg Temperature         85.46 76.47   63.5  NA      NA     
 4 Min Temperature         78    68.71   58    NA      NA     
 5 Dew Point (°F)          Max   Average Min   Polygon NA     
 6 Dew Point               76    66.82   55    NA      NA     
 7 Precipitation (in)      Max   Average Min   Sum     Polygon
 8 Precipitation           3.47  0.20    0.00  6.28    NA     
 9 Snowdepth               0.00  0.00    0.00  0.00    NA     
10 Wind (mph)              Max   Average Min   Polygon NA     
11 Wind                    30    8.5     0     NA      NA     
12 Gust Wind               49    1.32    0     NA      NA     
13 Sea Level Pressure (in) Max   Average Min   Polygon NA     
14 Sea Level Pressure      29.11 28.88   28.61 NA      NA     

[[2]]
# A tibble: 226 x 551
   X1     X2    X3    X4    X5    X6    X7       X8    X9   X10   X11   X12   X13   X14   X15   X16   X17   X18   X19   X20   X21   X22
   <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
 1 Time   Temp~ Dew ~ Humi~ Wind~ Pres~ Prec~    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 2 Aug  ~ Aug   1     2     3     4     5         6     7     8     9    10    11    12    13    14    15    16    17    18    19    20
 3 Aug    NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 4 1      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 5 2      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 6 3      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 7 4      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 8 5      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
 9 6      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
10 7      NA    NA    NA    NA    NA    NA       NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
# ... with 216 more rows, and 529 more variables: X23 <int>, X24 <int>, X25 <int>, X26 <int>, X27 <int>, X28 <int>, X29 <int>,
#   X30 <int>, X31 <int>, X32 <int>, X33 <int>, X34 <chr>, X35 <chr>, X36 <chr>, X37 <chr>, X38 <int>, X39 <dbl>, X40 <int>,
#   X41 <int>, X42 <dbl>, X43 <int>, X44 <int>, X45 <dbl>, X46 <int>, X47 <int>, X48 <dbl>, X49 <int>, X50 <int>, X51 <dbl>,
#   X52 <int>, X53 <int>, X54 <dbl>, X55 <int>, X56 <int>, X57 <dbl>, X58 <int>, X59 <int>, X60 <dbl>, X61 <int>, X62 <int>,
#   X63 <dbl>, X64 <int>, X65 <int>, X66 <dbl>, X67 <int>, X68 <int>, X69 <dbl>, X70 <int>, X71 <int>, X72 <dbl>, X73 <int>,
#   X74 <int>, X75 <dbl>, X76 <int>, X77 <int>, X78 <dbl>, X79 <int>, X80 <int>, X81 <dbl>, X82 <int>, X83 <int>, X84 <dbl>,
#   X85 <int>, X86 <int>, X87 <dbl>, X88 <int>, X89 <int>, X90 <dbl>, X91 <int>, X92 <int>, X93 <dbl>, X94 <int>, X95 <int>, ...
# i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

[[3]]
# A tibble: 32 x 1
   X1   
   <chr>
 1 Aug  
 2 1    
 3 2    
 4 3    
 5 4    
 6 5    
 7 6    
 8 7    
 9 8    
10 9    
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

[[4]]
# A tibble: 32 x 3
   X1    X2    X3   
   <chr> <chr> <chr>
 1 Max   Avg   Min  
 2 84    74.4  65   
 3 88    77.0  63   
 4 86    72.1  66   
 5 91    78.9  68   
 6 91    77.8  70   
 7 89    74.2  68   
 8 83    73.3  69   
 9 78    71.6  67   
10 83    74.2  66   
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

[[5]]
# A tibble: 32 x 3
   X1    X2    X3   
   <chr> <chr> <chr>
 1 Max   Avg   Min  
 2 63    61.3  58   
 3 64    61.2  59   
 4 68    63.9  60   
 5 71    66.6  62   
 6 73    70.1  68   
 7 72    68.5  65   
 8 71    68.2  67   
 9 68    66.8  65   
10 70    67.2  65   
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

[[6]]
# A tibble: 32 x 3
   X1    X2    X3   
   <chr> <chr> <chr>
 1 Max   Avg   Min  
 2 84    65.4  46   
 3 87    60.6  37   
 4 96    76.5  49   
 5 84    67.0  45   
 6 100   79.2  52   
 7 97    83.8  55   
 8 97    84.8  63   
 9 97    85.0  68   
10 96    79.7  60   
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

[[7]]
# A tibble: 32 x 3
   X1    X2    X3   
   <chr> <chr> <chr>
 1 Max   Avg   Min  
 2 10    6.0   0    
 3 10    4.9   0    
 4 20    10.0  6    
 5 17    10.1  0    
 6 13    5.7   0    
 7 21    10.9  3    
 8 13    8.0   3    
 9 10    6.5   0    
10 12    4.7   0    
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

[[8]]
# A tibble: 32 x 3
   X1    X2    X3   
   <chr> <chr> <chr>
 1 Max   Avg   Min  
 2 29.0  29.0  28.9 
 3 29.1  29.0  29.0 
 4 29.1  29.0  28.9 
 5 29.0  28.9  28.9 
 6 29.0  28.9  28.9 
 7 28.9  28.9  28.8 
 8 28.8  28.8  28.8 
 9 28.9  28.9  28.8 
10 29.0  28.9  28.9 
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

[[9]]
# A tibble: 32 x 1
   X1   
   <chr>
 1 Total
 2 0.00 
 3 0.00 
 4 0.04 
 5 0.71 
 6 0.00 
 7 0.00 
 8 3.47 
 9 0.00 
10 0.00 
# ... with 22 more rows
# i Use `print(n = ...)` to see more rows

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM