简体   繁体   English

如何使用R使用流式XML(RSS提要)?

[英]How to consume streaming XML (RSS feeds) with R?

I understand somewhat how to use the XML package to read and parse an XML file, such as a piece of an RSS feed. 我在某种程度上理解了如何使用XML包来读取和解析XML文件,例如一部分RSS feed。 However, what is the basic setup for continuously reading an RSS feed? 但是,连续读取RSS feed的基本设置是什么?

For example, imagine that I want to set up a facility that continuously reads the feed from http://evemaps.dotlan.net/feed/sovereignty and stores the data in some kind of R data structure (say, a data.frame ). 例如,假设我想建立一个设施,该设施可以从http://evemaps.dotlan.net/feed/sovereignty连续读取提要,并将数据存储在某种R数据结构中(例如data.frame )。 。 I imagine that I would need to do something like the following: 我想我将需要执行以下操作:

  1. Set up R on a server (eg RStudio Server on an AWS instance) 在服务器(例如,AWS实例上的RStudio服务器)上设置R
  2. Open a HTTP connection to the rss feed 打开与rss提要的HTTP连接
  3. Continuously read and parse distinct bits of the feed and add them to a data.frame which grows by each entry added 连续读取和解析提要的不同位,并将其添加到data.frame ,此值data.frame添加的每个条目而增加

However, this is still a rather vague pictures. 但是,这仍然是一张模糊的图片。 What are the basic packages and functions that I would need to string together to make this work? 我需要将哪些基本的软件包和功能串在一起才能完成这项工作? Meaning: what are the basic steps that I would need to put in place to create such a facility? 含义:创建此类设施需要采取哪些基本步骤? I'm not looking for anyone to write this facility for me (even though that would be nice!). 我不是在寻找任何人为我编写此工具(即使那会很好!)。 Rather, I'm trying to understand which overall steps are involved. 相反,我试图了解所涉及的总体步骤。

I think you're looking for . 我认为您正在寻找

With an RSS client (ie, your R application on AWS) you have 2 choices: polling or PubSubHubbub (aka webhooks, PuSH, and others). 使用RSS客户端(即,您在AWS上的R应用程序),您有2个选择:轮询或PubSubHubbub(又名webhooks,PuSH等)。 As mentioned here , with polling you may be throttled after exceeding some publisher's maximum-pings policy. 如前所述这里 ,与轮询可能会超出一些出版商的最大太平天国政策后节流。 With PuSH the publisher's server notifies your R application in realtime when there is a new update because it works as a subscription. 使用PuSH,发布服务器可以在有新更新时实时通知您的R应用程序,因为它可以作为订阅。

The SO answer linked above leads to the blog of popular pay-as-you-go hub provider, Superfeedr, and a post which describes the PuSH protocol's workflow and shows a command line implementation. 上面链接的SO答案指向热门的即付即用中心提供商Superfeedr的博客,以及描述PuSH协议的工作流程并显示命令行实现的帖子

You can hear more about the protocol from this Google IO 2010 presentation by one of the engineers who crafted PuSH. 您可以从制作PuSH的工程师之一的Google IO 2010演示中了解有关该协议的更多信息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM