简体   繁体   中英

File processing on load balancing server (Cluster)

I have application in PHP on cluster server . It copy file from aws bucket on server process the file (unzip file. convert PDF to XML using itext java, Read XML and save data to database) and the upload processed file back to bucket.

It works fine for single instance but in load balancing for multiple instances file under process on server disappears. I can not process file directly from bucket as I can not unzip it on bucket also can not run jar file on bucket. So I have to store file temporary for processing. Is there any way to handle this situation

A few possible solutions:

  • Use a central single key value store (database) to store the path of the file's that you are currently processing, when downloading a new file, check if this file isn't already being. You could use Redis for this
  • Upload a new, empty, file to S3, but with something in the file name so you know that if that file is present, the accompanying file is already being processed (Though I'm not sure if S3 caches directory listings) With this solution you should also consider the cost writing a file to S3, that also depends on your scale
  • Rename or remove the file from S3 while it's being processed

There can be multiple solutions to this:

One solution is to check and apply tags if the file is processed at the time of upload apply some tag like processed=true and when you are downloading files check for tags.

Better solution is to use lambda for this task.

You can use the pattern of

  1. S3 to lambda
  2. Lambda drops a message in SQS
  3. Application monitors SQS
  4. Application processes file
  5. Delete message.

在此输入图像描述

Or just have lambda do all the work on S3 upload. Depending on how long the process runs. Execution time is 5 mins. http://docs.aws.amazon.com/lambda/latest/dg/limits.html

For example:

Set up a lambda function to monitor the s3 on upload new object event. Then have the lambda function drop a message in SQS(From the event data it receives, the Lambda function knows the source bucket name and object key name). The server can monitor the queue, process the message, extract the file and upload it to a new bucket, delete the file from the old s3 bucket and then delete the message from the queue. If the server dies during processing, the message goes back onto the queue(visibility timeout). A way to ensure it is processed and deleted on the old bucket is to enable versioning and a life cycle policy. When processing the message if the files doesn't exist on the old bucket send an alert and/or check for the previous version. You can also have a life cycle policy on the old bucket to permanently delete version if they are older than X days.

Monitoring S3 with Lambda

http://docs.aws.amazon.com/lambda/latest/dg/with-s3.html

http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html

s3 Versioning

http://docs.aws.amazon.com/AmazonS3/latest/dev/Versioning.html

Select Permanently delete previous versions and then enter the number of days after an object becomes a previous version to permanently delete the object (for example, 455 days). http://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html

What you need is a system which will store the file without losses. There are many alternatives for that.

a) Another server

b) An SQS Queue. @strongiz answer above explains it very well.

c) Even another database.

In each of these cases, you need a flag which will define if file is processed or not. when file processing is complete

a) delete the file or,

b) Change the flag

Since, PHP is session oriented, you cant store data there permannently, so, you need to connect to another interface. In case of a database, You can actually store a the file path entry and a flag to determine if file is processed or not. So, even a combo of the 3 might also work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM