简体   繁体   中英

Dspace dual write to RDBMS and SOLR vs concurrency

I'd like to know how dspace manage indexing in both the database and solr while supporting concurrency. In other words, if 2 individuals try to write at the same time, on the same item (eg changing metadata), how do dspace ensure that the index will not be desynchronized with the database.

This can happen if USER1 write concurrently with USER 2 on the same metadata value, and the write to the database of USER 1 first happen, but then the Write to the database and the Index of USER2 happen, and then the write to the index of USER1 Happen.

In other words USER1 "write" will be in the index while User2 write will be in the database = inconsistency !!!

I wonder how this case can be avoid in dspace, which is a typical dual write problem.

With the Event system of dspace, i don't know how this can be avoided.

Does anyone knows?

在此处输入图片说明

In Solr, DSpace doesn't index just the single metadata change (when it occurs). It actually reindexes the entire Item in Solr.

What this means is that while concurrency is an issue in the Database layer (and writes/updates are synchronized in the database), it is not one in the Solr indexing process.

Here's what would/should happen in your example.

  1. User 1 and User 2 edit the same Item's title at the same time.
  2. The edits will be synchronized at the database level, so that the first one in happens first. Let's suppose User 1's edit happens first, then User 2's edit.
  3. User 1's edit will trigger a reindex in Solr. So will User 2's edit. This means this same item will be reindexed twice (once for each edit). These reindexes are not tied to the specific update (so the first reindex doesn't only index the title), but just tell Solr that the item was updated and it needs reindexing.
  4. On the first reindex, User 1's edit will have been made, so the Item will be indexed with that title
  5. By the time of the second reindex, User 2's edit will have been made (as reindexes take longer in nature than saving an edit to the DB layer), so the Item will be (re-)indexed with that updated title.

So, the simple answer here is that DSpace doesn't reindex individual modifications (which could end up out of order if not synchronized with the DB edits). Instead, it tracks which objects have been updated and triggers a reindex of the entire object's metadata. While this may seem like "overkill", the reindex of a single object in Solr is not all that process intensive, and it ensures that the object's current/latest metadata is indexed in Solr (in the case of simultaneous writes).

UPDATE: As requested (in comments below), here's how DSpace performs reindexing (in Solr) in much more detail.

  1. DSpace has a defined Event system. It follows configurations in the dspace.cfg in this section: https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace/config/dspace.cfg#L732
  2. By default, in DSpace 5, the IndexEventConsumer is what performs indexing for Solr. It is defined configured by default here: https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace/config/dspace.cfg#L732
  3. In DSpace 5, whenever someone changes an Item's metadata (or anything about an Item), the Item.update() method is called to actually save the changes back to the database layer.
  4. After saving all changes (using DatabaseManager.update() ), the Item.update() method generates a new MODIFY event in the Event System .
  5. That new MODIFY event is appended to the LinkedList of events in the current Context
  6. When the Context is committed (which happens when saving a change), it first calls commit on the DB connection (ending the transaction). Then, it sends the list of events to the Dispatcher ( BasicDispatcher is configured by default in dspace.cfg), which then in turn triggers the index in Solr (via the configured IndexEventConsumer )
  7. The IndexEventConsumer passes the list of update objects (in this case an Item) to the IndexingService ( SolrServiceImpl by default).
  8. Finally, SolrServiceImpl.indexContent() reads the latest metadata value(s) from the Database and indexes them in Solr.

The above logic is still a bit simplified (as it'd be way too complex to walk through every step of the code). But, the basic gist here is that each Item.update() call is treated as a database transaction. It also triggers the addition of a MODIFY event which is stored in the user's session (Context object). As soon as the DB transaction is committed, the MODIFY event is processed by the IndexEventConsumer which reindexes the entire Item .

So, in the case of simultaneous edits, two MODIFY events will be generated (one for each edit). However, the last MODIFY event will not be triggered until after the last database edit is committed. Therefore, the Solr index should always be in sync with the latest info in the Database.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM