简体   繁体   English

在R中解析具有可变XML结构的RSS Feed

[英]Parse RSS Feeds with variable XML structures in R

I am a XML novice trying to scrape and parse the following RSS feed http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml . 我是一位XML新手,尝试抓取并解析以下RSS feed http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml Along this, I ran into two questions: 在此过程中,我遇到了两个问题:

1) I would like to extract the nodes of individual news stories using xmlChildren on the parsed document as follows: 1)我想在解析的文档上使用xmlChildren提取单个新闻故事的节点,如下所示:

library(RCurl)
library(XML)
xml.url <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
script <- getURL(xml.url)
doc <- xmlParse(script)
doc.children = xpathApply(doc,"//entry",xmlChildren)

Although this procedure works well on other feeds, where the individual news releases are stored with nodes <items> , it does not work in this particular case with nodes <entry> as it returns an empty list. 尽管此过程在其他供稿上运行良好,但各个新闻发布都存储在节点<items> ,但在特定情况下,它对于节点<entry>无效,因为它返回一个空列表。 I am stuck here, as I cannot figure out what I miss in the structure of the XML document. 我被困在这里,因为我无法弄清楚我在XML文档的结构中缺少什么。

2) More generally: Can I implement this approach to handle both cases when the XML structure includes the individual news stories either in node <item> or in node <entry> without knowing the particular structure in advance? 2)更笼统地说:当XML结构在节点<item>或节点<entry>包含单个新闻报道时,我可以实现这种方法来处理这两种情况吗?

Any help is very much appreciated, thank you. 非常感谢您的任何帮助,谢谢。

You'll need to work with namespaces. 您将需要使用名称空间。 Here are XML and xml2 options: 这是XMLxml2选项:

# XML
ns <- xmlNamespaceDefinitions(doc, simplify=TRUE)
names(ns)[1] <- "x"
nodes <- getNodeSet(doc, "//x:entry", namespaces=ns)

# xml2
library(xml2)

XML_URL <- "http://xml.newsbox.ch/corporate_web/che/dufry/digest_en_year_2015_atom.xml"
doc <- read_xml(XML_URL)
ns <- xml_ns_rename(xml_ns(doc), d1="x")
xml_find_all(doc, "//x:entry", ns=ns)

Look at using the boolean() XPath operator to be able to handle multiple cases (ie the different feed formats). 看一下使用boolean()XPath运算符能够处理多种情况(即不同的提要格式)。

This may not exactly answer your question, but did you consider using a ready-made package like tm.plugin.webmining ? 这可能无法完全回答您的问题,但是您是否考虑使用像tm.plugin.webmining这样的现成软件包?

If you do not want to use the package, you can still inspect the code and see how they parsed the data. 如果您不想使用该程序包,则仍然可以检查代码并查看它们如何解析数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM