简体   繁体   中英

Indexing external rest api with solr, possible?

This question is maybe a weird one, but my employer has asked me to find out and thus I will.

In our application we use an external REST api to search for some data. This REST api has the possibility of delivering many types of data, but it is only possible to look up one type of data at a time. For example city names and street names. In our app we force the users to choose what data type to look for as they search, but now our users don't want to do this. So if they search for example 'los' they want the result to contain both "Los Angeles" and 'Losing Street'. For this to be possible for us right now, we would have to do two separate searches in the REST API and merge the results.

So instead my employer has read about Solr and is adamant that it is possible to index the REST API so that we use Solr to search for what we want in one search request. I am not so sure. Is it possible, and is it feasible?

Yes definitely possible to come up a solution for the requirement specified above. Basically solr is a full text search engine, and all the fields are indexed in solr by default. One can carryout different type of operation on these fields through analyzers and tokenizers combinations. You can map all the searchable field to one specific field(which are called copy fields ie like city name and street name -> text name) and operate your search on this one field to get result as desired.

solr is RESTful search engine, and it serves data in xml and optional JSON format. Its really useful platform to operate over huge data and doesn't help mush over analytics part like calculations.

Few of the benefits include auto-suggest, highlighting, facets, synonym search, n-gram search, auto-correct etc.

I think you should send a feature request to the REST API maintainer to support a composite search.

The only thing you can do to download the whole database from the REST API, and create an own database which you can index and search after that with your custom queries, and which you have to keep in sync with the REST API. I don't think you want to do that. It will work, but so called REST APIs usually don't decouple clients from the implementation of the service with links and semantic annotations. So I am afraid it will break easily by any change of the API.

Afaik Solr is a storage solution which supports full-text search and has a REST interface.

Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via XML, JSON, CSV or binary over HTTP. You query it via HTTP GET and receive XML, JSON, CSV or binary results.

You should have no trouble posting the data from the REST API to Solr using the Data Import Handler (DIH), Solr's RESTful interface, or something like Spring Data Solr once you actually have the data. The tricky part is how will you "crawl" the third-part REST API data?

Depending on whether the REST API provider gives you any way to paginate through the data, ie chronologically or alphabetically, you may be able to write a program outside of Solr that polls the REST API then stores the data in a local database before posting it to Solr. This will be easier if the REST API provider allows you to retrieve new or changed records updated after a certain time, so that your polling is efficient and only retrieves a small amount of data after the initial full indexing. Some REST providers allow using webhooks to notify your application that they have updated data in their API. This may or may not be feasible depending on the amount of data and whether you can limit it by user account, etc. to only contain what you need.

It's important to store the third party data in a local database outside of Solr, since Solr's index data files are volatile and sometimes need to be deleted after making configuration changes. That way, you can write a process to repost the data from your database to Solr without having to crawl the REST API again.

For handling the polling at regular intervals, you could use something like Apache Camel or Spring Integration along with Quartz Scheduler. Both of those support REST endpoints and you can also take a look at the DIH examples that come with Solr.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM