简体繁体 English

Solr / Lucene中的大型文档索引中的可能问题

[英]possible issues in indexing of large documents in solr/lucene

原文 2014-08-31 08:30:54 5 1 solr/ lucene/ buffer/ batch-processing

I am trying to index a large data in solr/lucene. 我正在尝试索引Solr / Lucene中的大数据。 Since It is a legacy system and because of some other reasons, I have to do it via a C++ layer. 由于它是一个遗留系统，并且由于其他一些原因，因此我必须通过C ++层进行操作。 But before doing that I wanted to optimize the process so I did google for that. 但在此之前，我想优化流程，所以我为此做了google。 I found out following things for that: 我发现以下几点：

Indexing in batches: which will help me in scenario where indexing will fail in between because of some failure. 批量索引：这将在因某些失败而在两次索引之间失败的情况下为我提供帮助。 So i can start with remaining batches again. 所以我可以从剩余的批次开始。
buffer lookup 缓冲区查找
indexer concurrency 索引器并发

I found the last 2 terms somewhere while looking for different issues, but I am unable to understand it fully. 我在寻找其他问题时在某处找到了最后两个词，但我无法完全理解它。

So if anyone can help me in understanding these two issues and any other issue which may arise. 因此，如果有人可以帮助我理解这两个问题以及可能出现的任何其他问题。

1 个解决方案

I'm not sure what you mean when you're mentioning " Buffer Lookup " - usually this is the case of allowing a server to have a decent in-memory cache, where as many queries as possible can be answered without having to recalculate the intersection between documents and which documents are contained in a certain set for each query. 我不确定当您提到“ 缓冲区查找 ”时是什么意思-通常是允许服务器具有良好的内存中高速缓存的情况，在这种情况下，可以尽可能多地查询查询而不必重新计算文档之间的交集以及每个查询包含在特定集合中的文档。 For Solr this is configured using the different *cache -settings. 对于Solr，使用不同的* cache -settings进行配置。 The requirements will be different for most applications, depending on query load, field definitions, etc. Performing a commit (making documents visible in the index) usually expires caches, as the cache might no longer be valid. 对于大多数应用程序，要求将有所不同，具体取决于查询负载，字段定义等。执行提交（使文档在索引中可见）通常会使高速缓存过期，因为高速缓存可能不再有效。

Indexer Concurrency allows a server to insert documents into the actual index from many threads at the same time, without locking between the threads. Indexer Concurrency允许服务器同时从多个线程将文档插入到实际索引中，而无需在线程之间进行锁定。 Lucene made concurrent indexing possible back in 2011 (for Lucene 4.0), and allows faster and more efficient updates of the index. Lucene 在2011年（对于Lucene 4.0）使并发索引成为可能，并允许更快，更有效地更新索引。 Whether this matters depends on your application. 这是否重要取决于您的应用程序。