简体   繁体   中英

Name is not XML Namespace compliant

I'm trying to read the table on this site:

http://spacefem.com/pregnant/due.php?use=EDD&m=09&d=10&y=16

I use rvest , but quickly get an error:

library(rvest)
read_html("http://spacefem.com/pregnant/due.php?use=EDD&m=09&d=10&y=16")

Error: Name spoiler:3tbt4d3m is not XML Namespace compliant [202]

What does this error mean, and is there anything I can do to get around it?

I've gotten as far as pinpointing the internal function causing the error: xml2:::doc_parse_raw . However, xml2:::doc_parse_raw is simply a call to internal C code, making debugging of this issue substantially more difficult.

The HTML contains a malformed tag that's causing problems, specifically <spoiler:3tbt4d3m> , as the error suggests. If you grab the HTML with httr without parsing it, you can use regex to remove that tag and its contents without incident, as a quick look reveals that it doesn't contain the table.

library(httr)
library(rvest)

url <- 'http://spacefem.com/pregnant/due.php?use=EDD&m=09&d=10&y=16'

html <- url %>% GET(user_agent('R')) %>% content('text')

html2 <- gsub('<spoiler:3tbt4d3m>.*</spoiler:3tbt4d3m>', '', html)

df <- html2 %>% read_html() %>% 
    html_node(xpath = '//table[@border="1"]') %>% 
    # obviously insufficient to parse double headers, but at least the data exists now
    html_table(fill = TRUE)

df[1:5, 1:3]
##                        Date Progress Overall probability ofspontaneous labor
## 1                      Date Progress                            On this date
## 2 Saturday August 6th, 2016  35W, 0D                                   0.01%
## 3   Sunday August 7th, 2016  35W, 1D                                   0.01%
## 4   Monday August 8th, 2016  35W, 2D                                   0.02%
## 5  Tuesday August 9th, 2016  35W, 3D                                   0.02%

Mixing regex and HTML makes me a bit uneasy, so maybe there's a cleaner way of tidying, but before parsing I'm not sure what it would be.

Another option is to use htmltidy (need to use v0.3.0 or higher which means—as of the date of this answer—using the development version vs CRAN version until CRAN is up to 0.3.0+) to "clean" the document:

library(rvest)
library(htmltidy) # devtools::install_github("hrbrmstr/htmltidy")
library(httr)

URL <- "http://spacefem.com/pregnant/due.php?use=EDD&m=09&d=10&y=16"

# the site was not returning content for me w/o a more browser-like user agent

res <- GET(URL, user_agent("Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36"))

cleaned <- tidy_html(content(res, as="text", encoding="UTF-8"),
                     list(TidyDocType="html5"))

pg <- read_html(cleaned)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM