简体繁体 English

加快在PHP中读取多个XML文件

[英]Speed up reading multiple XML files in PHP

原文 2011-12-14 03:09:09 6 2 php/ xml/ performance/ optimization/ simplexml

I currently have a php file that must read hundreds of XML files, I have no choice on how these XML files are constructed, they are created by a third party. 我目前有一个必须读取数百个XML文件的php文件，我没有选择如何构造这些XML文件，它们是由第三方创建的。

The first xml file is a large amount of titles for the rest of the xml files, so I search the first xml file to get file names for the rest of the xml files. 第一个xml文件是其余xml文件的大量标题，因此我搜索第一个xml文件以获取其余xml文件的文件名。

I then read each xml file searching its values for a specific phrase. 然后，我读取每个xml文件，搜索其特定短语的值。

This process is really slow. 这个过程非常慢。 I'm talking 5 1/2 minute runtimes... Which is not acceptable for a website, customers wont stay on for that long. 我正在谈论5 1/2分钟的运行时间......对于一个网站而言，这是不可接受的，客户不会长时间保持这种状态。

Does anyone know a way which could speed my code up, to a maximum runtime of approx 30s. 有没有人知道一种方法可以加快我的代码，最大运行时间约30秒。

Here is a pastebin of my code : http://pastebin.com/HXSSj0Jt 这是我的代码的粘贴框： http ： //pastebin.com/HXSSj0Jt

Thanks, sorry for the incomprehensible English... 谢谢，抱歉难以理解的英语......

2 个解决方案

First of all if you have to deal with large xml files for each request to your service it is wise to download the xml's once, preprocess and cache them locally. 首先，如果您必须为服务的每个请求处理大型xml文件，那么下载xml一次，预处理并在本地缓存它们是明智的。

If you cannot preprocess and cache xml's and have to download them for each request (which I don't really believe is the case) you can try optimize by using XMLReader or some SAX event-based xml parser. 如果您无法预处理和缓存xml，并且必须为每个请求下载它们（我不相信是这种情况），您可以尝试使用XMLReader或某些基于SAX事件的xml解析器进行优化。 The problem with SimpleXML is that it is using DOM underneath. SimpleXML的问题在于它正在使用DOM。 DOM (as the letters stand for) creates document object model in your php process memory which takes a lot of time and eats tons of memory. DOM（正如字母所代表的）在php进程内存中创建文档对象模型，这需要花费大量时间并占用大量内存。 I would risk to say that DOM is useless for parsing large XML files. 我冒险说DOM对于解析大型XML文件毫无用处。

Whereas XMLReader will allow you to traverse the large XML node by node without barely eating any memory with the tradeoff that you cannot issue xpath queries or any other non-consequencial node access patterns. 而XMLReader将允许您逐个遍历大型XML节点，而不会占用任何内存，无法进行权衡，您无法发出xpath查询或任何其他非重复节点访问模式。

How to use xmlreader you can consult with php manual for XMLReader extension 如何使用xmlreader，您可以参考php手册获取XMLReader扩展

Your main problem is you're trying to make hundreds of http downloads to perform the search. 您的主要问题是您正在尝试进行数百次http下载以执行搜索。 Unless you get rid of that restriction, it's only gonna go so fast. 除非你摆脱这种限制，否则它只会走得那么快。

If for some reason the files aren't cachable at all (unlikely), not even some of the time, you can pick up some speed by downloading in parallel. 如果由于某种原因，这些文件是不是在所有 （不太可能）被缓存，甚至有些时候不是，你可以通过并行下载拿起一些速度。 See the curl_multi_*() functions. 请参阅curl_multi _ *（）函数。 Alternatively, use wget from the command line with xargs to download in parallel. 或者，使用命令行中的wget与xargs并行下载。

The above sounds crazy if you have any kinda of traffic though. 如果你有任何类型的流量，上面的声音听起来很疯狂。

Most likely, the files can be cached for at least a short time. 最有可能的是，文件可以缓存至少很短的时间。 Look at the http headers and see what kind of freshness info their server sends. 查看http标头，看看他们的服务器发送了哪种新鲜度信息。 It might say how long until the file expires, in which case you can save it locally until then. 它可能会说文件到期之前有多长时间，在这种情况下，您可以在此之前将其保存在本地。 Or, it might give a last modified or etag, in which case you can do conditional get requests, which should speed things up still. 或者，它可能会给出最后修改或etag，在这种情况下，您可以执行条件获取请求，这应该可以加快速度。

I would probably set up a local squid cache and have php make these requests through squid. 我可能会设置一个本地squid缓存并让php通过squid发出这些请求。 It'll take care of all the use the local copy if its fresh, or conditionally retrieve a new version logic for you. 如果它是新的，或者有条件地为您检索新的版本逻辑，它将负责所有使用本地副本。

If you still want more performance, you can transform cached files into a more suitable format(eg, stick the relevant data in a database). 如果您仍需要更高的性能，可以将缓存的文件转换为更合适的格式（例如，将相关数据粘贴到数据库中）。 Or if you must stick with the xml format, you can do a string search on the file first, to test whether you should bother parsing that file as xml at all. 或者如果你必须坚持使用xml格式，你可以先对文件进行字符串搜索，以测试是否应该将该文件解析为xml。