在R中解析具有可变XML结构的RSS Feed

Question

I am a XML novice trying to scrape and parse the following RSS feed http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml . 我是一位XML新手，尝试抓取并解析以下RSS feed http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml 。 Along this, I ran into two questions: 在此过程中，我遇到了两个问题：

1) I would like to extract the nodes of individual news stories using xmlChildren on the parsed document as follows: 1）我想在解析的文档上使用xmlChildren提取单个新闻故事的节点，如下所示：

library(RCurl)
library(XML)
xml.url <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
script <- getURL(xml.url)
doc <- xmlParse(script)
doc.children = xpathApply(doc,"//entry",xmlChildren)

Although this procedure works well on other feeds, where the individual news releases are stored with nodes <items> , it does not work in this particular case with nodes <entry> as it returns an empty list. 尽管此过程在其他供稿上运行良好，但各个新闻发布都存储在节点<items> ，但在特定情况下，它对于节点<entry>无效，因为它返回一个空列表。 I am stuck here, as I cannot figure out what I miss in the structure of the XML document. 我被困在这里，因为我无法弄清楚我在XML文档的结构中缺少什么。

2) More generally: Can I implement this approach to handle both cases when the XML structure includes the individual news stories either in node <item> or in node <entry> without knowing the particular structure in advance? 2）更笼统地说：当XML结构在节点<item>或节点<entry>包含单个新闻报道时，我可以实现这种方法来处理这两种情况吗？

Any help is very much appreciated, thank you. 非常感谢您的任何帮助，谢谢。

Answer 1

You'll need to work with namespaces. 您将需要使用名称空间。 Here are XML and xml2 options: 这是XML和xml2选项：

# XML
ns <- xmlNamespaceDefinitions(doc, simplify=TRUE)
names(ns)[1] <- "x"
nodes <- getNodeSet(doc, "//x:entry", namespaces=ns)

# xml2
library(xml2)

XML_URL <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
doc <- read_xml(XML_URL)
ns <- xml_ns_rename(xml_ns(doc), d1="x")
xml_find_all(doc, "//x:entry", ns=ns)

Look at using the boolean() XPath operator to be able to handle multiple cases (ie the different feed formats). 看一下使用boolean（）XPath运算符能够处理多种情况（即不同的提要格式）。

Answer 2

This may not exactly answer your question, but did you consider using a ready-made package like tm.plugin.webmining ? 这可能无法完全回答您的问题，但是您是否考虑使用像tm.plugin.webmining这样的现成软件包？

If you do not want to use the package, you can still inspect the code and see how they parsed the data. 如果您不想使用该程序包，则仍然可以检查代码并查看它们如何解析数据。

在R中解析具有可变XML结构的RSS Feed

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-11-30 12:42:02

解决方案2
1 2015-11-30 12:45:54

在R中解析具有可变XML结构的RSS Feed

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-11-30 12:42:02

解决方案2 1 2015-11-30 12:45:54

解决方案1
2 已采纳 2015-11-30 12:42:02

解决方案2
1 2015-11-30 12:45:54