简体   繁体   中英

Is Lucene a good choice for Key/Value HashMap?

I am facing a problem. I am doing a mini web crawler. Right now is important to have an efficient HashMap. I just want key/value data structure with only inserts and lookups.

I know Lucene can do the job, just by having two fields: key and value; but is it efficient? Is there any other solutions more simple?

Ps: It can be in PHP or Java but I would prefer PHP.

Note: I need it to be persisted. And it will be open and closed several times.

If all you want is a fast, persistent key-value store for a non-enormous dataset, Lucene probably isn't the best solution- Berkeley DB would be the obvious choice. That said, Grant Ingersoll gave a talk at this year's Lucene Revolution conference about exactly this. He intentionally came at the question with a pro-Lucene bias, and got into a back-and-forth with several audience members about what contemporary document databases (like CouchDB) provide that Lucene doesn't. For any non-huge dataset that might eventually need secondary indexes, I think this is a great solution. Lucene's performance for key/value lookups won't be quite as fast Berkeley DB, CouchDB, Tokyo Tyrant or the like, but it's still quite speedy, more than adequate for many apps. I think he measured roughly 50ms for a key/value lookup on a recent laptop. And if later on you need to add secondary indexes (as it seems like you might on the results of a web crawl), you'll have a much easier time with Lucene than with those products.

Other tools, like BDB, will be simpler to code for than Lucene. But if that's a concern, just use Solr, which makes it easy to add docs and search via simple HTTP calls (you'll want to modify the fields in the schema.xml config file, but otherwise, Solr should be ready-to-use out of the box).

Now, if your dataset is too big to reasonably fit on one machine, distributed key-value stores, like Project Voldemort or Riak, might be easier to setup and administer. But Lucene will get you pretty far on one machine, especially if you aren't indexing many fields- at least a TB, I'd guess.

If you do use Lucene, I'd think hard about whether there truly aren't any properties other than the key you'd like to search by- might as well get them stored the first time, since Lucene makes it easy.

I've (ab)used solr as a key value store on a couple occasions with tens of millions of records. Also, we have a index in production that includes a full copy of the indexed data in json format and we run queries that return this value so that we can avoid a redundant and much slower database lookup.

So, depending on your needs, it is a quite OK solution but you need to be aware of the limitations.

Pros.

1) If you are already using solr or lucene, it is convenient to not have to use another technology.

2) Lucene is pretty good at lookups of single rows and should scale well for that purpose.

3) With a few extra columns you gain querying capability as well.

Cons 1) Lucene is not designed as a transactional store. Typically you add multiple rows and then commit them. So, writes are not atomic in the ACID sense. Usually that's a bad thing if you are storing important data. (near) real-time indexing is possible these days but it still requires a lot of fiddling to get right.

2) Because there is a delay between when you add and when you commit, that means reading your own writes may be problematic.

3) If you need a lot of write throughput, it is best to index in bulk. If you need to write individual keys one by one, your throughput is going to suffer.

4) While lucene excels at querying, large result sets are problematic. For example, a query that produces all the keys of your values can get very expensive on a solr index with tens of millions of rows.

您可以看一下面向文档的数据库,例如CouchdbMongoDB

You might want to look into Solr , it is a best practice implementation of Lucene. It is a REST based interface and is pretty straight forward to setup and there is a PHP client that you can use.

Lucene is the wrong tool for the job you describe.

The simplest solution is a HashMap and it's fairly efficient. Is there any particular reason you think a HashMap would be a bad solution?

If you need to scale out to a cluster, I'd switch over to Memcached.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM