简体繁体 English

在php或python中读取RSS提要/其他内容？

[英]Reading RSS feeds in php or python/something else?

原文 2012-07-08 09:43:30 4 3 php/ python/ symfony

I currently am developing a website in the Symfony2 framework, and i have written a Command that is run every 5 minutes that needs to read a tonne of RSS news feeds, get new items from it and put them into our database. 我目前正在Symfony2框架中开发一个网站，我编写了一个每5分钟运行一次的Command，需要读取一大堆RSS新闻源，从中获取新项目并将它们放入我们的数据库。

Now at the moment the command takes about 45 seconds to run, and during those 45 seconds it also takes about 50% to up to 90% of the CPU, even though i have already optimized it a lot. 现在，命令运行大约需要45秒，在这45秒内，它也需要大约50％到高达90％的CPU，即使我已经对它进行了大量优化。

So my question is, would it be a good idea to rewrite the same command in something else, for example python? 所以我的问题是，用其他东西重写相同的命令是不是一个好主意，例如python？ Are the RSS/Atom libraries available for python faster and more optimized than the ones available for PHP? 用于python的RSS / Atom库是否比PHP可用的更快，更优化？

Thanks in advance, Jaap 在此先感谢，Jaap

3 个解决方案

You can parse raw XML using lxml which users underlying libxml C iibrary: 您可以使用lxml解析原始XML，这些用户基于libxml C iibrary：

http://lxml.de/parsing.html http://lxml.de/parsing.html

Because parsing is done using native code it's fast. 因为解析是使用本机代码完成的，所以速度很快。

Someone is already doing in: 有人已经在做：

Encoding error while parsing RSS with lxml 使用lxml解析RSS时出现编码错误

On the other hand if the bottleneck is not XML parsing, but downloading data and sorting it out, then the bottleneck is somewhere else. 另一方面，如果瓶颈不是XML解析，而是下载数据并将其排序，那么瓶颈就在其他地方。

You could try to check Cache-Headers of the feeds first before parsing them. 在解析它们之前，您可以先尝试检查Feed的Cache-Headers。
This way you can save the expensive parsing operations on probably a lot of feeds. 这样，您可以在很多源上节省昂贵的解析操作。

Store a last_updated date in your db for the source and then check against possible cache headers. 在数据库中为源存储last_updated日期，然后检查可能的缓存标头。 There are several, so see what fits best or is served the most or check against all. 有几个，所以看看什么最适合或服务最多或检查所有。
Headers could be: 标题可能是：

Expires 过期
Last-Modified 上一次更改
Cache-Control 缓存控制
Pragma 附注
ETag ETag的

But beware: you have to trust your feed sources. 但要注意：你必须相信你的饲料来源。
Not every feed provides such headers or provides them correctly. 并非每个Feed都提供此类标头或正确提供它们。
But i am sure a lot of them do. 但我相信他们中的很多人都会这样做。

Is solved this by adding a usleep() function at the end of each iteration of a feed. 通过在每次迭代的feed结尾添加一个usleep（）函数来解决这个问题。 This drastically lowered cpu and memory consumption. 这大大降低了CPU和内存消耗。 The process used to take about 20 minutes, and now only takes around and about 5! 过程大约需要20分钟，现在只需要大约5分钟！