简体繁体 English

如何在流集中执行 elasticsearch 查找

[英]How to perform elasticsearch lookup in streamsets

原文 2021-03-02 08:54:55 2 3 linux/ docker/ elasticsearch/ logstash/ streamsets

I am accepting two kinds of records A and B in Streamsets v3.21 - there is a common field called correlationid common between the parent type A and multiple child type B. Type A always arrives first.我在 Streamsets v3.21 中接受两种记录 A 和 B - 在父类型 A 和多个子类型 B 之间有一个称为相关ID 的公共字段。类型 A 总是先到达。 Type A and Type B get written to separate elasticsearch indices on the same cluster from the same pipeline. A 型和 B 型被写入到来自同一管道的同一集群上的不同 elasticsearch 索引。 The sending and composition of type A and type B is not within my control. A型和B型的发送和组合不在我的控制范围内。 They are pre-processed by Logstash 7.81 by a filter group to which I can add new files, but not alter existing ones.它们由 Logstash 7.81 通过过滤器组进行预处理，我可以向其中添加新文件，但不能更改现有文件。

There is a field X on type A that I need to put in the Type B records that get written to elasticsearch.我需要将类型 A 上的字段 X 放入写入 elasticsearch 的类型 B 记录中。 Does anyone know a way of making elasticsearch update the type B when they arrive by looking up type A?有谁知道当 elasticsearch 通过查找 A 型到达时更新 B 型的方法？ Alternatively can anyone tell me a way of looking up the type A on elasticsearch (from streamsets) before type B are written and applying value X to the type B records?或者，谁能告诉我在写入 B 类型并将值 X 应用于 B 类型记录之前在 elasticsearch（来自流集）上查找 A 类型的方法？ _Alternatively_I've considered using an environment variable named as correlationid with value X so that I can look it up but I'm concerned about blowing the heap as I can never know when to remove the env var as there can be up to N type B records _Alternatively_我考虑过使用一个名为correlationid 且值为X 的环境变量，以便我可以查找它，但我担心会炸毁堆，因为我永远不知道何时删除env var，因为最多可以有N 个类型B记录
Alternatively maybe logstash could cache the value of correlationid and X somehow;或者，logstash 可能会以某种方式缓存相关 ID 和 X 的值； there is a filter called "environment" whcih would allow me to store env_vars for type A and apply them to type B but I can find no way to clear it down periodically有一个名为“环境”的过滤器，它允许我为 A 类型存储 env_vars 并将它们应用于 B 类型，但我找不到定期清除它的方法

3 个解决方案

How about using the jython evaluator and 'state' object.如何使用jython 评估器和“状态”object。 You can (carefully) use the state object for a cache and just add a field to a record before sending to elastic.您可以（小心地）使用 state object 进行缓存，并在发送到弹性之前将字段添加到记录中。

In the end I just gave up (even streamsets consultants didnt know) and wrote it myself in threaded java using ElasticSearch API classes最后我只是放弃了（甚至流集顾问都不知道）并使用 ElasticSearch API 类在线程 java 中自己编写

You can also setup JDBC driver for elasicsearch.您还可以为 elasicsearch 设置 JDBC 驱动程序。 https://www.elastic.co/guide/en/elasticsearch/reference/master/sql-jdbc.html https://www.elastic.co/guide/en/elasticsearch/reference/master/sql-jdbc.html

And then use JDBC lookup stage in your pipeline.然后在您的管道中使用 JDBC 查找阶段。 JDBC lookup stage supports providing a JDBC driver class. JDBC 查找级支持提供 JDBC 驱动程序 class。 https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Processors/JDBCLookup.html https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Processors/JDBCLookup.html

To be 100% honest, I didn't try it with elasticsearch driver.老实说，我没有尝试使用 elasticsearch 驱动程序。 But did try with other JDBC drivers.但确实尝试了其他 JDBC 驱动程序。

It used to work great.它曾经工作得很好。

Another way of doing that is to use Scala or Groovy evaluator.另一种方法是使用 Scala 或 Groovy 评估器。 And perform the lookup inside the scala or Groovy code.并在 scala 或 Groovy 代码中执行查找。