简体   繁体   English

R 中的网页抓取:解析 HTML 表格的问题

[英]Web Scraping in R: Issues with parsing an HTML table

I have been trying to scrape a table from a website so that I can reformat it in R. I have done this before for other websites, but am finding this to be particularly challenging.我一直试图从网站上抓取一张表格,以便我可以在 R 中重新格式化它。我之前为其他网站做过这个,但我发现这特别具有挑战性。 My code is below:我的代码如下:

library(rvest)

URL <- "http://www.barttorvik.com/schedule.php"
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
page <- read_html(GET(URL, user_agent(uastring)))

tbls <- page %>%
  html_nodes("#tblData") %>% # name of the table on the website
  html_table(trim = FALSE) # This returns a data frame with the right columns but no data

If you run this, you will see that this returns an empty data frame (or a list of an empty data frame).如果你运行它,你会看到它返回一个空的数据框(或一个空数据框的列表)。 I have been looking through other threads and have yet to find a fix.我一直在查看其他线程,但尚未找到修复方法。

I appreciate your feedback!我感谢您的反馈!

This table seems to have an addition row using colspan="6"该表似乎有一个使用colspan="6"的附加行

page %>%
  html_nodes("td") %>% tail(1)

{xml_nodeset (1)}
[1] <td colspan="6" style="background: #5b9bd5; color: #fff; text-transform: uppercase;\r\ntext-align:center; font-size:10px">MOV Mean absolute error: 7.57 | Totals MAE: 29.8 | \r\nScore bias: -29.8 |<span cl ...

I think this could somehow be solved with the unpivotr package我认为这可以通过unpivotr包以某种方式解决

Otherwise you can try:否则你可以尝试:

library(rvest)

URL <- "http://www.barttorvik.com/schedule.php"
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
page <- read_html(GET(URL, user_agent(uastring)))

cols<- page %>%
  html_nodes("th") %>%
  html_text()

data <- page %>%
  html_nodes("td") %>%
  html_text()


finaldata <- data.frame(matrix(data[-126], ncol=5, byrow=TRUE)) # leavs out the colspan
names(finaldata) <- cols

finaldata

  Time (CT)                                              Matchup                            T-Rank Line TTQ      Result
1    6:00 PM                 45 North Carolina St. at 52 Virginia             Virginia -2.5, 58-55 (62%)  65            
2    6:00 PM                         55 Texas at 10 West Virginia        West Virginia -9.4, 67-58 (85%)  60            
3    8:00 PM                              58 Oklahoma at 5 Baylor              Baylor -12.8, 75-63 (89%)  59            
4    3:00 PM                    108 Charlotte at 186 Old Dominion            Charlotte -0.5, 57-56 (53%)  59            
5    6:00 PM                          146 Winthrop at 161 Radford              Radford -3.1, 74-71 (62%)  54            
6    7:30 PM              233 Texas Southern at 289 Grambling St.        Grambling St. -0.0, 76-75 (50%)  54            
7    3:00 PM                       241 LIU Brooklyn at 313 Wagner         LIU Brooklyn -0.9, 78-77 (53%)  49            
8    4:00 PM                215 Sacred Heart at 228 Robert Morris        Robert Morris -2.9, 74-71 (61%)  47            
9    6:30 PM             285 North Carolina A&T at 311 Morgan St.           Morgan St. -2.4, 72-69 (60%)  41            
10   6:00 PM                    259 UNC Asheville at 334 Longwood        UNC Asheville -2.8, 75-72 (61%)  40            
11   7:30 PM              179 Prairie View A&M at 322 Jackson St.     Prairie View A&M -5.5, 70-64 (72%)  39            
12   3:00 PM        301 North Carolina Central at 288 Florida A&M          Florida A&M -3.7, 65-61 (66%)  35            
13   6:00 PM                     234 Campbell at 331 Presbyterian             Campbell -3.1, 65-62 (64%)  34            
14   6:00 PM                          263 Bucknell at 156 Colgate             Colgate -10.6, 77-66 (85%)  34            
15   7:00 PM                           251 Rice at 60 North Texas         North Texas -16.2, 77-60 (94%)  32            
16   3:00 PM                  271 Merrimack at 188 St. Francis PA       St. Francis PA -8.2, 68-60 (82%)  30            
17   7:30 PM            315 Alcorn St. at 347 Arkansas Pine Bluff           Alcorn St. -3.2, 66-62 (64%)  28            
18   6:00 PM           317 St. Francis NY at 275 Mount St. Mary's     Mount St. Mary's -5.4, 67-61 (73%)  28            
19   9:05 PM                    310 Weber St. at 184 Portland St.        Portland St. -12.2, 78-66 (87%)  26            
20   6:00 PM                      327 Hampton at 248 Gardner Webb        Gardner Webb -10.2, 81-71 (82%)  24            
21  11:00 AM                                64 Yale at 348 Howard                Yale -21.6, 78-56 (98%)  19 Yale, 89-75
22   4:00 PM           278 Southern at 351 Mississippi Valley St.            Southern -10.5, 80-69 (84%)  15            
23   5:00 PM                    344 High Point at 306 USC Upstate         USC Upstate -10.0, 74-64 (84%)  15            
24   2:30 PM   350 Central Connecticut at 320 Fairleigh Dickinson Fairleigh Dickinson -14.6, 80-66 (91%)   5            
25   6:30 PM 352 Maryland Eastern Shore at 323 South Carolina St.  South Carolina St. -15.8, 75-60 (94%)   0  

For whatever reason your page ( http://www.barttorvik.com/schedule.php ) lacks opening <tr> elements for the table body.无论出于何种原因,您的页面 ( http://www.barttorvik.com/schedule.php ) 缺少表格正文的开放<tr>元素。 They are present in table header - <thead> part but not in <tbody> .它们出现在表头 - <thead>部分但不在<tbody> You can easily verify this by opening page source (Ctrl+U) and searching for tr> - you will find closing elements </tr> in the table, but not <tr> .您可以通过打开页面源代码 (Ctrl+U) 并搜索tr>轻松验证这一点 - 您会在表中找到关闭元素</tr> ,但不会找到<tr>

If you correct this by adding missing elements all works.如果您通过添加缺少的元素来纠正此问题,则一切正常。 Browsers are programmed to cope with bad html (they render), but rvest does not know what to do with this, therefore it only displays header row.浏览器被编程为处理糟糕的 html(它们呈现),但rvest不知道如何处理这个,因此它只显示标题行。 You could write code to insert <tr> before every <td style='text-align:left;white-space:nowrap' id="mobileout"> or maybe you could tell page authors to correct their code generator?您可以编写代码在每个<td style='text-align:left;white-space:nowrap' id="mobileout"> <tr>之前插入<tr>或者您可以告诉页面作者更正他们的代码生成器?

> mypage <- read_html("~/tmp/test.html")
# test.html is file where I made the necessary replacements
> mypage%>% html_nodes('table') %>% html_table()
[[1]]
   Time (CT)                                              Matchup                            T-Rank Line TTQ Result
1    6:00 PM                 45 North Carolina St. at 51 Virginia             Virginia -2.5, 58-55 (62%)  65     NA
2    6:00 PM                         55 Texas at 10 West Virginia        West Virginia -9.4, 67-58 (85%)  60     NA
3    3:00 PM                    109 Charlotte at 186 Old Dominion            Charlotte -0.5, 57-56 (53%)  59     NA
4    6:00 PM                          145 Winthrop at 161 Radford              Radford -3.0, 74-71 (62%)  54     NA
5    7:30 PM              233 Texas Southern at 288 Grambling St.        Grambling St. -0.0, 76-75 (50%)  54     NA
6    3:00 PM                       240 LIU Brooklyn at 312 Wagner         LIU Brooklyn -0.9, 78-77 (54%)  49     NA
7    4:00 PM                215 Sacred Heart at 230 Robert Morris        Robert Morris -2.9, 74-71 (61%)  47     NA
8    6:30 PM             284 North Carolina A&T at 313 Morgan St.           Morgan St. -2.3, 72-69 (59%)  41     NA
9    6:00 PM                    259 UNC Asheville at 334 Longwood        UNC Asheville -2.8, 75-72 (61%)  40     NA
10   7:30 PM              179 Prairie View A&M at 323 Jackson St.     Prairie View A&M -5.5, 70-64 (72%)  39     NA
11   3:00 PM        302 North Carolina Central at 289 Florida A&M          Florida A&M -3.7, 65-61 (66%)  35     NA
12   6:00 PM                     234 Campbell at 331 Presbyterian             Campbell -3.1, 65-62 (64%)  34     NA
13   6:00 PM                          263 Bucknell at 156 Colgate             Colgate -10.5, 77-66 (84%)  34     NA
14   7:00 PM                           251 Rice at 61 North Texas         North Texas -16.2, 77-60 (94%)  32     NA
15   3:00 PM                  271 Merrimack at 188 St. Francis PA       St. Francis PA -8.2, 68-60 (82%)  30     NA
16   7:30 PM            315 Alcorn St. at 347 Arkansas Pine Bluff           Alcorn St. -3.2, 66-62 (64%)  28     NA
17   6:00 PM           317 St. Francis NY at 278 Mount St. Mary's     Mount St. Mary's -5.4, 67-61 (72%)  28     NA
18   9:05 PM                    310 Weber St. at 184 Portland St.        Portland St. -12.1, 78-66 (87%)  26     NA
19   6:00 PM                      327 Hampton at 249 Gardner Webb        Gardner Webb -10.3, 82-71 (82%)  24     NA
20  11:00 AM                                60 Yale at 348 Howard                Yale -21.6, 78-56 (98%)  18     NA
21   4:00 PM           277 Southern at 351 Mississippi Valley St.            Southern -10.5, 80-69 (83%)  15     NA
22   5:00 PM                    344 High Point at 306 USC Upstate         USC Upstate -10.0, 74-64 (84%)  15     NA
23   2:30 PM   350 Central Connecticut at 320 Fairleigh Dickinson Fairleigh Dickinson -14.6, 80-66 (91%)   5     NA
24   6:30 PM 352 Maryland Eastern Shore at 324 South Carolina St.  South Carolina St. -15.8, 76-60 (94%)   0     NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM