简体   繁体   中英

Solr DIH delta import

We plan to use MySQL as the RDBMS in web app. But also send the data to Solr to support faster search. I am seeking advice on which methodology is recommended and why:

  1. Set up a cron job for periodic update (say every 30 min) - most tutorials seem to suggest this.
  2. OR during http post , send data to both mySQL and Solr.

The app will have user posted comments and different range attributes requiring full text search and facets.

Edit: For those who find their way to this topic, solr wiki has a brief write up on this at https://wiki.apache.org/solr/SolrPerformanceFactors

If you need to serve users with near-real time data then you should go with second approach. It will update the data in solr indexes and make is available for users to search.

If you don't need near-real time search for users then you should go by the first approach, which is updating the index every 30 minutes.

But remember these do might require some different configuration in your solr setup.

As younghobbit suggested more insight into the kind of application data makes this easier to answer.

Even so I will layout some points known to me and then you might be able to make slightly more informed choice.

  1. SOLR index is not like a Transaction processing database. It is designed to be efficient for raw text content searches and internally it does some nice stuff to help with the speed of search (am not expert on SOLR internals .. so SOLR experts please feel free to elaborate on the 'nice stuf'). The process of indexing the data for search is not very cheap and thus best to let SOLR do its indexing magic every X minutes and not all the time. After-all you want it to use most of its available resources to provide most relevant results for the search.
  2. You can send data to SOLR as often as you want, but it really becomes available only after a commit. You can commit after each operation or let SOLR do its auto commit every x minutes. (I can't recall the exact configuration but I think its 15 mins or so). A commit is what really triggers the resource hungry indexing process, so doing too many commits is not good. On the other hand, too few commits will lead to out dated index.
  3. Since you have a MySQL db, I am guessing there are records that get updated as well. As of 4.x SOLR internally SOLR didn't actually update documents. The way SOLR deals with updates is that it marks the old document as deleted and simply creates a new document. This means that each update causes SOLR to use incrementally more space on the disk. You can occasionally call "optimize" operation and SOLR will remove the 'deleted' docs. Again Optimize is resource hungry and best done when the server is less busy. Also Optimize causes SOLR to use up more disk-space (rule thumb = index size * 2) during the optimization. Imagine if you have a MySQL record that gets updated like 10 times in a span of 30 mins, then that would lead to 9 deleted and one active document in SOLR if you send data to SOLR on each http post. Whereas in case of a 30 min cron job it will mean 1 or max of 2 records being posted.
  4. SOLR is not exactly transactional. It has commit and rollback operations but they work on all documents added since last commit. (suggest reading up SOLR documentation on this). This is different from your http posts where typically commit, rollback on the MySQL db will be within the scope of the same http request. eg you send data to SOLR on each http post and let us say you encounter a scenario that requires rollback, MySQL will do a clean rollback, but SOLR rollback is not feasible as it might potentially rollback other changes that were made while the current http post processing was in progress.

Personally I think approach 1 is better but you may want to tweak the frequency of the cron to get a near-real time search response. Truly real time can only be achieved by approach 2, but you have to consider about how you deal with updates, transactions in connection with SOLR. Please get good understanding of the commit, rollback, optimize operations in SOLR before choosing either options.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM