简体   繁体   中英

Hadoop for processing data from Apache solr

I have to process some huge amount of data.I would like it to be processed using distributed computing(Scalable). I am fetching data from apache Solr.On passing a particular input i get a huge dataset from apache solr.For each record in this dataset i will pass the primary key to a REST api to obtain some information which will be attached to the record.Then each record will undergo some update.Each updated object in final huge collection will be written as seperate xml files into a folder.

Is hadoop applicable in this particular scenario?.I have seen the wordcount sample in hadoop mapreduce documentation.I tried to think of my situation in a similar way in which map emitted by map reduce for 2 nodes will be

Node1 - Map<InputIdToSolr1,Set<RecordsFromSolr1to500>>
Node2 - Map<InputIdToSolr1,Set<RecordsFromSolr500to1000>>

Then this results will be combined by the reduce function in hadoop.Unlike wordcount my nodes will have only one element in map for each node.I am not sure if using hadoop makes sense. What are other options/open source java projects i can use to scale the processing of records.I have seen Terracotta from spring but it seems to be a commercial application.

Don't know the scale of scalabilty you are looking for, but I would first try a multithreaded solution on a multicore box.

If the performance does not match expectations, and you have the flexibility of getting more hardware and instances of your application, you may start thinking of a Map-Reduce solution.

Terracota is not from Spring/SpringSource/VMWare, although it's proprietary and commercial.

Have you considered using NoSQL databases? The decision of which one to use really depends on the shape of your data. To check them out (all open source):

More about NoSQL databases.

Edit:
I've just stumbled upon this webinar from Couchbase and Cloudera (Hadoop solution & support company) where they're going to discuss NoSQL + Hadoop usage.

The task is sounds suited for Hadoop's MapReduce. More then that - Lucene and Hadoop are created by the same person Doug Cutting. In you case you can consider different levels of integration. Simplest one will be put your datasets into HDFS, then select / write input format suited your data format and in the Mapper make Your REST call to complete the record.
If you have a lot of different, but relatively simple processing, I would suggest considering representing your data as Hive tables - or from HDFS, or in the SOLR.
I am not adept of SOLR architecture, but, if your Using apache nutch together with SOLR - you might have hadoop integrated within and may use it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM