简体繁体 English

在服务器之间的Perl脚本之间同步项目列表

[英]Sync item list between perl scripts, across servers

原文 2016-11-14 05:03:52 8 1 multithreading/ perl/ synchronization

I have a multi-threaded perl script which does the following: 我有一个执行以下操作的多线程perl脚本：

1) One boss thread searches through a folder structure on an external server. 1）一个老板线程在外部服务器上的文件夹结构中搜索。 For each file it finds, it adds its path/name to a thread queue. 对于找到的每个文件，它将其路径/名称添加到线程队列中。 If the path/file is already in the queue, or being processed by the worker threads, the enqueuing is skipped. 如果路径/文件已经在队列中，或者正在由工作线程处理，则跳过入队。

2) A dozen worker threads dequeue from the above queue, process the files, and remove them from the hard disk. 2）十几个工作线程从上述队列中出队，处理文件，并将其从硬盘中删除。

It runs on a single physical server, and everything works fine. 它在单个物理服务器上运行，并且一切正常。

Now I want to add a second server, which will work concurrently with the first one, searching through the same folder structure, looking for files to enqueue/process. 现在，我想添加第二个服务器，该服务器将与第一个服务器同时工作，搜索相同的文件夹结构，查找要入队/处理的文件。 I need a means to make both servers aware of what each one is doing, so that they don't process the same files. 我需要一种使两个服务器都知道每个服务器正在做什么的方法，以便它们不会处理相同的文件。 The queue is minimal, ranging from 20 to 100 items. 队列最小，范围从20到100。 The list is very dynamic and changes many times per second. 该列表非常动态，每秒变化多次。

Do I simply write to/read from a regular file to keep them sync'ed about the current items list? 我是否只是简单地写入/读取常规文件以使它们与当前项目列表保持同步？ Any ideas? 有任何想法吗？

1 个解决方案

I would be very wary of using a regular file - it'll be difficult to manage locking and caching semantics. 我会非常警惕使用常规文件-很难管理锁定和缓存语义。

IPC is a big and difficult topic, and when you're doing server to server - it can get very messy indeed. IPC是一个大而艰巨的话题，当您在服务器之间做服务器时，确实会变得非常混乱。 You'll need to think about much more complicated scenarios, like 'what if host A crashes with partial processing'. 您需要考虑更复杂的场景，例如“如果主机A在部分处理时崩溃了，该怎么办”。

So first off I would suggest you need to (if at all possible) make your process idempotent. 所以首先，我建议您（如果有可能）使您的过程成为幂等。 Specifically - set it up so IF both servers do end up processing the same things, then no harm is done - it's 'just' inefficient. 具体来说-进行设置，以便如果两个服务器最终都处理相同的事情，那么就不会造成任何危害-这是“低效”的。

I can't tell you how to do this, but the general one is to permit (and discard) duplication of effort. 我无法告诉您如何执行此操作，但是一般的做法是允许（并放弃）重复工作。

In terms of synchronising your two processes on different servers - I don't think a file will do the trick - shared filesystem IPC is not really suitable for a near real time sort of operation, because of caching. 就在不同服务器上同步两个进程而言-我认为文件无法解决问题-共享文件系统IPC由于缓存而实际上不适合进行近实时操作。 Default cache lag on NFS is somewhere in the order of 60s. NFS上的默认缓存滞后时间约为60秒。

I would suggest that you think in terms of sockets - they're a fairly standard way of server to server IPC. 我建议您从套接字的角度考虑-它们是服务器到服务器IPC的相当标准的方式。 As you already check 'pending' items in the queue, expanding this to query the other host (note - consider what you'll do if it's offline or otherwise unreachable) before enqueing. 由于您已经检查了队列中的“待定”项目，因此在入队前将其扩展为查询其他主机（注意-如果它离线或无法访问，请考虑要做什么）。

The caveat here is parallelism works better the less IPC is going on. 需要注意的是，并行性在IPC进行得越少的情况下效果更好。 Talking across a network is generally a bit faster than talking to a disk, but it's considerably slower than the speed at which a processor runs. 通过网络进行通信通常比与磁盘进行通信要快一些，但是比处理器运行的速度要慢得多。 So if you can work out some sort of caching/locking mechanism, where you don't need to update for each and every file - then it'll run much better. 因此，如果您可以制定某种缓存/锁定机制，而无需为每个文件进行更新-那么它将运行得更好。