如何從R中提取xml頁面中的信息

Question

我正在嘗試從此頁面獲取所有信息： http ： //ws.parlament.ch/affairs/19110758/？format = xml

首先，我將文件下載到file ，然后用xmlParse(file)解析它。

download.file(url = paste0(http://ws.parlament.ch/affairs/19110758/?format=xml), destfile = destfile)
file <- xmlParse(destfile[])

我現在想要提取我需要的所有信息。 例如標題和ID號。 我試過這樣的事情：

title <- xpathSApply(file, "//h2", xmlValue)

但這只給我一個錯誤： unable to find an inherited method for function 'saveXML' for signature '"XMLDocument"

接下來我嘗試的是這個：

library(plyr)

test <-ldply(xmlToList(file), function(x) { data.frame(x[!names(x)=="id"]) } )

這給了我一個帶有一些Info的data.frame 。 但我失去了諸如id信息（這是最重要的）。

我想得到一個帶有一行（每個事件只有一行）的data.frame ，其中包含一個事件的所有信息，例如id``updated additionalIndexing``affairType等。

有了它，它的工作原理（例如id ）：

infofile <- xmlRoot(file)

nodes <-  getNodeSet(file, "//affair/id")
id <-as.numeric(lapply(nodes, function(x) xmlSApply(x, xmlValue)))

Answer 1

它是一個HTML文件，而不是XML文件。 你需要使用htmlParse ：

destfile <- tempfile() # make this example copy-pasteable
download.file(url = "http://ws.parlament.ch/affairs/19110758/?format=xml", destfile = destfile)
file <- htmlParse(destfile)
title <- xpathSApply(file, '//h2')
xmlValue(title[[1]])
# [1] "Heilmittelwesen. Gesetzgebung"

Answer 2

這將使您獲得XML：

library(XML)
library(RCurl)
library(httr)

srcXML <- getURL("http://ws.parlament.ch/affairs/19110758/?format=xml", 
            .opts=c(user_agent("Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)"),
              verbose()))

myXMLFile <- xmlTreeParse(substr(srcXML,4,nchar(srcXML)))

我本來只使用了來自httr GET()但它似乎並沒有很好地傳遞user-agent （當我不在代理后面時我需要測試它以確定具體的錯誤是什么）。 我也做了substr()因為前面有一堆奇怪的字符導致xmlTreeParse()調用錯誤輸出。

如何從R中提取xml頁面中的信息

問題描述

2 個解決方案

解決方案1
4 2014-03-28 15:55:29

解決方案2
2 已采納 2014-03-28 17:01:25

如何從R中提取xml頁面中的信息

問題描述

2 個解決方案

解決方案1 4 2014-03-28 15:55:29

解決方案2 2 已采納 2014-03-28 17:01:25

解決方案1
4 2014-03-28 15:55:29

解決方案2
2 已采納 2014-03-28 17:01:25