Multiple worker threads and Database synchronization

Question

I have multiple threads that save files on disk and put that information to DB.

On the other side of app I have multiple threads that read from DB this information and process mentioned files one at a time sorted by file_id :

SELECT * FROM files_to_process ORDER BY file_id

What I've invented is to make a PROCESSING_STATUS column which has four statuses NEW , PROCESSING , FAILED , SUCCESS .

Every worker is supposed to read ONLY one row from DB sorted by ID with status NEW and update immediately to status PROCESSING , so the other workers won't process the same file.

But, something tells me that I might end up with some race condition.

Will transactions solve this problem?

Unfortunately I can't make all operation inside transaction since processing files takes a lot of time and transaction pool will be exhausted, so I have to make two transactions in the following order.

[In Transaction] Fetch row and update to status PROCESSING
[No Transaction] Process File
[In Transaction] Update final state to SUCCESS or FAILED depending on result

Answer 1

Quite annoyingly, UPDATE does not take a LIMIT in PostgreSQL.

You can do something like this:

update files_to_process set processing_status='PROCESSING' where file_id = (
    SELECT file_id FROM files_to_process 
      WHERE processing_status = 'NEW' 
      ORDER BY file_id FOR UPDATE SKIP LOCKED LIMIT 1
) returning *;

With this formulation, there should be no race conditions. You would run this in a transaction by itself (or under autocommit, just run the statement and it will automatically form its own transaction).

But rather than using just 'PROCESSING', I would probably make it 'PROCESSING by machine worker7 PID 19345' or something like that. Otherwise, how will you known when processing failed if fails in an unclean way? (That is the nice thing about doing it in one transaction, failures should rollback themselves).

Unfortunately I can't make all operation inside transaction since processing files takes a lot of time and transaction pool will be exhausted

But you should never have more outstanding transactions than you have CPUs available to do work. Unless you have a very large compute farm, you should be able to make the pool large enough. But the big problem with this approach is that you have no visibility into what is happening.

For the two transaction approach, for performance you will probably want to make a partial index:

create index on files_to_process (file_id ) where processing_status = 'NEW';

Otherwise you will have to dig through all of the completed ones with low file_id to find the next NEW one, and eventually that will get slow. You might also need to VACUUM the table more aggressively than the default.

Answer 2

Try a mutex, simplistic example:

try {
  mutex.acquire();
  try {
    // access and update record to processing
  } finally {
    mutex.release();
  }
} catch(InterruptedException ie) {
  // ...
}

Depending on your code you may lock it various ways, see: Is there a Mutex in Java?

EDIT:

Sorry thought this was a c++ quesiton, this is the java version

Multiple worker threads and Database synchronization

Question

2 answers

solution1
1 ACCPTED 2020-06-11 14:53:44

solution2
0 2020-06-11 11:43:58

Multiple worker threads and Database synchronization

Question

2 answers

solution1 1 ACCPTED 2020-06-11 14:53:44

solution2 0 2020-06-11 11:43:58

solution1
1 ACCPTED 2020-06-11 14:53:44

solution2
0 2020-06-11 11:43:58