简体繁体 English

索引属性文件

[英]Indexing properties files

原文 2009-06-23 10:06:49 3 4 java/ indexing/ lucene

I need to index a large number of Java properties and manifest files. 我需要索引大量的Java属性和清单文件。

The data in the files is just key-value pairs. 文件中的数据只是键值对。

I am thinking to use Lucene for this. 我正在考虑为此使用Lucene。

However, I do not need any real full-text search capabilities, as the data is quite structured. 但是，我不需要任何真正的全文搜索功能，因为数据结构合理。 I only need to search for exact matches of property values, and the property key is always known. 我只需要搜索属性值的完全匹配项，并且属性键始终是已知的。 There is no need for tokenizing, and there is also no "default" field. 无需标记化，也没有“默认”字段。 The number of unique property keys could be quite large. 唯一属性键的数量可能会很大。

I should also add that I hope to be able to hold the index entirely in memory (in Lucene that would be a RAMDirectory). 我还应该补充一点，我希望能够将索引完全保留在内存中（在Lucene中将是RAMDirectory）。

So, is Lucene (as primarily a full-text search-engine) still a good match, or does something else fit better? 那么，Lucene（主要是全文搜索引擎）仍然是不错的选择，还是其他更合适的选择？

Update: A simple HashMap will not do, because I want to find the files that define property A as value B. It would need to be at least a nested HashMap to hold the triples ( Key , Value, Filename ). 更新：一个简单的HashMap不会做，因为我想找到将属性A定义为值B的文件。它必须至少是一个嵌套的HashMap来容纳三元组（Key，Value，Filename）。

4 个解决方案

Yes, a Lucene index with a non tokenized field per key will do the trick. 是的，每个键都具有未标记化字段的Lucene索引可以解决问题。 It's also a bit of an overkill, some sort of Map structure will probably be enough for what you are describing. 这也有些过分，某种Map结构可能足以满足您的描述。

The main benefit of using Lucene here would be that it abstracts away the details into a fairly simple API. 在这里使用Lucene的主要好处是可以将细节抽象成一个相当简单的API。

I would start with a simple HashMap, and if you run into memory problems then move to something more complicated like Lucene. 我将从一个简单的HashMap开始，如果遇到内存问题，请转到更复杂的对象，如Lucene。 You'd be surprised how efficient a HashMap can be. 您会惊讶于HashMap的效率如何。

If you want to start really simple, just use the Properties object itself - it's an instance of HashTable (see HashMap vs HashTable ). 如果要开始非常简单，只需使用Properties对象本身-它是HashTable的实例（请参见HashMap与HashTable ）。 You can easily use load(Inputstream) to load multiple properties files into a simple object, and then if you decide to try HashMap switch it using new HashMap(propertiesObject) . 您可以轻松地使用load（Inputstream）将多个属性文件加载到一个简单的对象中，然后，如果您决定尝试使用HashMap ，请使用新的HashMap（propertiesObject）对其进行切换。

If you don't need full-text searching, and only want to represent a large key-value map, then I suggest that Lucene is inappropriate. 如果您不需要全文本搜索，而只想表示一个较大的键值映射，那么我建议Lucene是不合适的。

I'd suggest something like EhCache, which allows you to hold a large chunk of the data in RAM, but can swpa out to a disk file if it gets too large. 我建议使用类似EhCache的东西，它可以让您将大量数据保存在RAM中，但是如果数据太大，可以将其切换到磁盘文件中。

Take a look at jdbm - it is a light-weight, open source object database that has a fast B+Tree implementation that should work for you. 看一下jdbm-这是一个轻量级的开放源代码对象数据库，它具有快速的B + Tree实现，应该可以为您服务。 If you don't need high-reliability, you can turn off the log part of the database (this makes inserts much faster, at the risk of corrupting the database if you have a power failure in the middle of a write). 如果您不需要高可靠性，则可以关闭数据库的日志部分（这会使插入速度更快，如果在写入过程中出现电源故障，则有可能损坏数据库）。

We've been using jdbm in several production projects for 4 or 5 years now with some really, really big data sets. 我们已经在四个或五个生产项目中使用jdbm了4到5年了，其中包含一些非常非常大的数据集。

If you can hold the entire index in memory, though, you'd probably be better off using a TreeMap (or multiple TreeMaps if you need to also do reverse indexing), and just serialize it if you need to save to disk. 但是，如果可以将整个索引保存在内存中，则最好使用TreeMap（如果需要同时执行反向索引，则可以使用多个TreeMap），如果需要保存到磁盘，则只需对其进行序列化即可。