[英]Using lmax Disruptor (3.0) in java to process millions of documents
I have the following use-case:我有以下用例:
When my service starts, it may need to deal with millions of documents in as short of a burst as possible.当我的服务启动时,它可能需要在尽可能短的时间内处理数百万份文档。 There will be three sources of data.将有三个数据源。
I have set up the following:我已经设置了以下内容:
/* batchSize = 100, bufferSize = 2^30
public MyDisruptor(@NonNull final MyDisruptorConfig config) {
batchSize = config.getBatchSize();
bufferSize = config.getBufferSize();
this.eventHandler = config.getEventHandler();
ThreadFactory threadFactory = createThreadFactory("disruptor-threads-%d");
executorService = Executors.newSingleThreadExecutor(threadFactory);
ringBuffer = RingBuffer.createMultiProducer(new EventFactory(), bufferSize, new YieldingWaitStrategy());
sequenceBarrier = ringBuffer.newBarrier();
batchEventProcessor = new BatchEventProcessor<>(ringBuffer, sequenceBarrier, eventHandler);
ringBuffer.addGatingSequences(batchEventProcessor.getSequence());
executorService.submit(batchEventProcessor);
}
public void consume(@NonNull final List<Document> documents) {
List<List<Document>> subLists = Lists.partition(documents, batchSize);
for (List<Document> subList : subLists) {
log.info("publishing sublist of size {}", subList.size());
long high = ringBuffer.next(subList.size());
long low = high - (subList.size() - 1);
long position = low;
for (Document document: subList) {
ringBuffer.get(position++).setEvent(document);
}
ringBuffer.publish(low, high);
lastPublishedSequence.set(high);
}
}
Each of my sources calls consume, I use Guice to create a Singleton disruptor.我的每个来源都调用了消耗,我使用 Guice 创建了一个单例干扰器。
My eventHandler routine is我的 eventHandler 例程是
public void onEvent(Event event, long sequence, boolean endOfBatch) throws Exception {
Document document = event.getValue();
handler.processDocument(document); //send the document to handler
if (endOfBatch) {
handler.processDocumentsList(); // tell handler to process all documents so far.
}
}
I am seeing in my logs that the producer ( consume
) is stalling at times.我在我的日志中看到生产者( consume
)有时会停止。 I assume that this is when the ringBuffer is full, and the eventHandler is not able to process quickly enough.我假设这是在 ringBuffer 已满时,并且 eventHandler 无法足够快地处理。 I see that the eventHandler is processing documents (from my logs) and then after a while the producer starts publishing more documents to the ring buffer.我看到 eventHandler 正在处理文档(来自我的日志),然后过了一段时间,生产者开始将更多文档发布到环形缓冲区。
Questions:问题:
endOfBatch
.我选择使用 batchEventProcessor 所以它会发出endOfBatch
信号。Is your handler
stateful?你的handler
有状态的吗? If not, you can use multiple parallel event handlers to process the documents.如果没有,您可以使用多个并行事件处理程序来处理文档。 You could implement a basic sharding strategy where only one of the handlers processes each event.您可以实现一种基本的分片策略,其中只有一个处理程序处理每个事件。
endOfBatch
is usually used to speed up the speed of processing by optimising IO operations that benefit from batching. endOfBatch
通常用于通过优化受益于批处理的 IO 操作来加快处理速度。 Eg writing to file on each event but only flushing on endOfBatch
.例如,在每个事件上写入文件,但仅在endOfBatch
上endOfBatch
。
It's hard to give any more advice without know what happens in your document processor.如果不知道文档处理器中发生了什么,就很难提供更多建议。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.