如何从R中提取xml页面中的信息

Question

I'm trying to get all the info from this page: http://ws.parlament.ch/affairs/19110758/?format=xml 我正在尝试从此页面获取所有信息： http ： //ws.parlament.ch/affairs/19110758/？format = xml

First I download the file into file and parse it then with xmlParse(file) . 首先，我将文件下载到file ，然后用xmlParse(file)解析它。

download.file(url = paste0(http://ws.parlament.ch/affairs/19110758/?format=xml), destfile = destfile)
file <- xmlParse(destfile[])

I now want to extract all the information I need. 我现在想要提取我需要的所有信息。 For example the title and the ID-number. 例如标题和ID号。 I tried something like this: 我试过这样的事情：

title <- xpathSApply(file, "//h2", xmlValue)

But this gives me only an error: unable to find an inherited method for function 'saveXML' for signature '"XMLDocument" 但这只给我一个错误： unable to find an inherited method for function 'saveXML' for signature '"XMLDocument"

Next thing I tried is this: 接下来我尝试的是这个：

library(plyr)

test <-ldply(xmlToList(file), function(x) { data.frame(x[!names(x)=="id"]) } )

This gives me a data.frame with some Info. 这给了我一个带有一些Info的data.frame 。 But I lose info such as id (which is most important). 但我失去了诸如id信息（这是最重要的）。

I'd like to get a data.frame with a row (only one row per affair) containing all the Information of one affair, such as id``updated additionalIndexing``affairType etc. 我想得到一个带有一行（每个事件只有一行）的data.frame ，其中包含一个事件的所有信息，例如id``updated additionalIndexing``affairType等。

With this, it works (example for id ): 有了它，它的工作原理（例如id ）：

infofile <- xmlRoot(file)

nodes <-  getNodeSet(file, "//affair/id")
id <-as.numeric(lapply(nodes, function(x) xmlSApply(x, xmlValue)))

Answer 1

It is an HTML file, not an XML file. 它是一个HTML文件，而不是XML文件。 You need to use htmlParse : 你需要使用htmlParse ：

destfile <- tempfile() # make this example copy-pasteable
download.file(url = "http://ws.parlament.ch/affairs/19110758/?format=xml", destfile = destfile)
file <- htmlParse(destfile)
title <- xpathSApply(file, '//h2')
xmlValue(title[[1]])
# [1] "Heilmittelwesen. Gesetzgebung"

Answer 2

This will get you to your XML: 这将使您获得XML：

library(XML)
library(RCurl)
library(httr)

srcXML <- getURL("http://ws.parlament.ch/affairs/19110758/?format=xml", 
            .opts=c(user_agent("Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"),
              verbose()))

myXMLFile <- xmlTreeParse(substr(srcXML,4,nchar(srcXML)))

I would have used just GET() from httr but it doesn't seem to pass the user-agent along well (I need to test it when I'm not behind a proxy to be sure of what the specific error is). 我本来只使用了来自httr GET()但它似乎并没有很好地传递user-agent （当我不在代理后面时我需要测试它以确定具体的错误是什么）。 I also did the substr() as there's a bunch of weird characters at the front that cause the xmlTreeParse() call to error out. 我也做了substr()因为前面有一堆奇怪的字符导致xmlTreeParse()调用错误输出。

如何从R中提取xml页面中的信息

问题描述

2 个解决方案

解决方案1
4 2014-03-28 15:55:29

解决方案2
2 已采纳 2014-03-28 17:01:25

如何从R中提取xml页面中的信息

问题描述

2 个解决方案

解决方案1 4 2014-03-28 15:55:29

解决方案2 2 已采纳 2014-03-28 17:01:25

解决方案1
4 2014-03-28 15:55:29

解决方案2
2 已采纳 2014-03-28 17:01:25