简体   繁体   English

在 PHP 中同时处理多个 XML 文件

[英]Process Multiple XML files at the same time in PHP

Hello im making a component in PHP that reads a atom file and get a list of xmls for process, i need to parse them and insert the data on the database.你好,我正在用 PHP 制作一个组件,它读取原子文件并获取进程的 xml 列表,我需要解析它们并将数据插入数据库。

For each type of XML (news, scores, schedules) i do something like this对于每种类型的 XML(新闻、比分、时间表),我都会做这样的事情

  1. Get XML list to process获取要处理的 XML 列表
  2. insert XML URL on the database and put process state = 0在数据库上插入 XML URL 并放置进程状态 = 0
  3. Loop trought the list循环遍历列表
  4. Open XML URL save to the disk打开 XML URL 保存到磁盘
  5. Process过程
  6. Put file state = 1放置文件状态 = 1
  7. Go next下一步

Thing is i got a lot of ram and cores on my machine, but the list keep growing and the pending files to process is allways bigger and bigger.问题是我的机器上有很多内存和内核,但列表不断增长,待处理的待处理文件总是越来越大。

I want to know how can i do to process let´s say 10 files at the same time as i got ram and cores to process but if i process one at time pending list will just allways get bigger.我想知道如何在处理 ram 和内核的同时处理 10 个文件,但是如果我在某个时间处理一个待处理的列表,则它总是会变大。

i appreciate some ideas and appologize for my english我感谢一些想法并为我的英语道歉

You could try something like a divide and conquer in you step 4. Here is a simple implementation of parallel batch processing .您可以在步骤 4 中尝试诸如分治之类的方法。这是并行批处理的简单实现。

You may also try parallel curling .您也可以尝试平行卷曲 This PHP class providing an easy interface for running multiple concurrent CURL requests.这个PHP 类为运行多个并发 CURL 请求提供了一个简单的接口。

You're using the database as a queue.您将数据库用作队列。 This normally is discouraged (there is software that does this better), and you're running into a typical problem with that in your example:这通常是不鼓励的(有软件可以做得更好),并且您在示例中遇到了一个典型的问题:

The process state field you've got is initialized with the value 0 .您获得的进程状态字段已初始化为值0 You then process each entry with the value 0 .然后处理值为0每个条目。 Let's say processing an entry takes 10 minutes.假设处理一个条目需要 10 分钟。 And you insert one URL per minute.然后每分钟插入一个 URL。 So you need to process 10 URLs in parallel to cope with the insertion rate.所以你需要并行处理10个URL来应对插入率。 Let's play this through:让我们玩这个:

  • So in the first minute you insert the first URL and you start to process it.所以在第一分钟你插入第一个 URL 并开始处理它。 As the 10 processors take the first URL with the status 0 all 10 processors process the first URL.由于 10 个处理器采用状态为0的第一个 URL,因此所有 10 个处理器都处理第一个 URL。

  • In the second minute you insert the second URL and you still process ten times the first URL.在第二分钟,您插入第二个 URL,并且您仍然处理第一个 URL 的十倍。

  • In the third minute you insert the third URL and you still process ten times the first URL.在第三分钟,您插入第三个 URL,并且您仍然处理第一个 URL 的十倍。

And so on.等等。 You get the picture.你得到了图片。 The status is not manage properly.状态管理不善。 As you design the queue-system your own you need to take care that it works for parallel requirements.当您设计自己的队列系统时,您需要注意它是否适用于并行需求。 You should create a component for that and test it thoroughly with fake-data and logging so that you can track and verify it's operation.应该为此创建一个组件并使用假数据和日志对其进行彻底测试,以便您可以跟踪和验证它的操作。 Then use such a system for the real thing.然后将这样的系统用于真实的事物。 It might not do everything you want, but it should work much more robust.它可能不会做你想做的一切,但它应该工作得更健壮。

Alternatively get a component for a queue that has been created already, has been tested and which has been work proven.或者,为已经创建的队列获取一个组件,已经过测试并且已经过工作证明。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM