简体繁体 English

将所有文档的列更新为Elasticsearch的最佳实践

[英]Best practice to update a column of all documents to Elasticsearch

原文 2018-10-06 09:55:43 4 1 python/ elasticsearch

I'm developing a log analysis system. 我正在开发一个日志分析系统。 The input are log files. 输入是日志文件。 I have an external Python program that reads the log files and decide whether a record (line) or the log files is "normal" or "malicious". 我有一个外部Python程序，该程序读取日志文件并确定记录（行）或日志文件是“正常”还是“恶意”。 I want to use Elasticsearch Update API to append my Python program's result ("normal" or "malicious") to Elasticsearch by adding a new column called result . 我想使用Elasticsearch Update API通过添加一个名为result的新列，将我的Python程序的结果（“正常”或“恶意”）附加到Elasticsearch。 So I can see my program's result clearly via Kibana UI. 这样我可以通过Kibana UI清楚地看到程序的结果。

Simply speaking, my Python code and Elasticsearch both use log files as input respectively. 简单来说，我的Python代码和Elasticsearch都分别使用日志文件作为输入。 Now I want to update the result from Python code to Elasticsearch. 现在，我想将结果从Python代码更新为Elasticsearch。 What's the best way to do it? 最好的方法是什么？

I can think of several ways: 我可以想到几种方法：

Elasticsearch automatically assigns a ID ( _id ) to a document. Elasticsearch自动为文档分配一个ID（ _id ）。 If I can find out how Elasticsearch calculates _id , then my Python code can calculate it by itself, then update the corresponding Elasticsearch document via _id . 如果我可以找到Elasticsearch如何计算_id ，那么我的Python代码可以自行计算，然后通过_id更新相应的Elasticsearch文档。 But the question is, Elasticsearch official documentation doesn't say about what algorithm it uses to generate _id . 但是问题是，Elasticsearch官方文档没有说明它用于生成_id算法。
Add an ID (like line number) to the log files by myself. 我自己将一个ID（如行号）添加到日志文件中。 Both my program and Elasticsearch will know this ID. 我的程序和Elasticsearch都将知道此ID。 My program can use this ID to update. 我的程序可以使用此ID进行更新。 However, the downside is that my program has to search for this ID every time because it's only a normal field instead of a built-in _id. 但是，缺点是我的程序每次都必须搜索此ID，因为它只是一个普通字段，而不是内置的_id。 The performance will be very bad. 性能会很差。
My Python code gets the logs from Elasticsearch instead of reading the log files directly. 我的Python代码从Elasticsearch获取日志，而不是直接读取日志文件。 But this makes the system fragile, as Elasticsearch becomes a critical point. 但这使系统易碎，因为Elasticsearch成为关键点。 I only want Elasticsearch to be a log viewer currently. 我只希望Elasticsearch当前成为日志查看器。

So the first solution will be ideal in the current view. 因此，在当前视图中，第一个解决方案将是理想的。 But I'm not sure if there are any better ways to do it? 但是我不确定是否还有更好的方法？

1 个解决方案

If possible, re-structure your application so that instead of dumping plain-text to a log file you're directly writing structured log information to something like Elasticsearch. 如果可能，请重新构建应用程序的结构，以便将结构化的日志信息直接写入诸如Elasticsearch之类，而不是将纯文本格式转储至日志文件。 Thank me later. 晚点再谢我。

That isn't always feasible (eg if you don't control the log source). 这并不总是可行的（例如，如果您不控制日志源）。 I have a few opinions on your solutions. 我对您的解决方案有一些意见。

This feels super brittle . 这感觉很脆。 Elasticsearch does not base _id on the properties of a particular document. Elasticsearch并不基于特定文档的属性_id 。 It's selected based off of existing _id fields that it has stored (and I think also off of a random seed). 根据已存储的现有_id字段进行选择（我认为也是基于随机种子）。 Even if it could work, relying on an undocumented property is a good way to shoot yourself in the foot when dealing with a team that makes breaking changes even for its documented code as often as Elasticsearch does. 即使可行，依赖未记录的属性也是与Elasticsearch经常对记录文件进行重大更改的团队打交道的好方法。
This one actually isn't so bad . 这个其实还不错 。 Elasticsearch supports manually choosing the id of a document. Elasticsearch支持手动选择文档的ID。 Even if it didn't, it performs quite well for bulk terms queries and wouldn't be as much of a bottleneck as you might think. 即使不是这样，它在批量词查询中的表现也很好，不会像您想象的那样成为瓶颈。 If you really have so much data that this could break your application then Elasticsearch might not be the best tool. 如果您确实有太多数据可能会破坏您的应用程序，那么Elasticsearch可能不是最佳工具。
This solution is great . 这个解决方案很棒。 It's super extensible and doesn't rely on a complicated dependence on how the log file is constructed, how you've chosen to index that log in Elasticsearch, and how you're choosing to read it with Python. 它具有超强的可扩展性，并且不依赖于复杂的依赖关系，即如何构建日志文件，如何选择在Elasticsearch中为该索引编制索引以及如何选择使用Python读取日志。 Rather you just get a document, and if you need to update it then you do that updating. 相反，您只是得到一个文档，如果您需要更新它，则可以进行更新。
Elasticsearch isn't really a worse point of failure here than before (if ES goes down, your app goes down in any of these solutions) -- you're just doing twice as many queries (read and write). Elasticsearch在这里并不是失败的最糟糕点（如果ES下降，那么您的应用程序在任何这些解决方案中都将下降）–您所做的查询（读取和写入）是原来的两倍。 If a factor of 2 kills your application, you either need a better solution to the problem (ie avoid Elasticsearch), or you need to throw more hardware at it. 如果因数2使您的应用程序终止，则您可能需要一个更好的解决方案（即避免使用Elasticsearch），或者需要投入更多的硬件。 ES supports all kinds of sharding configurations, and you can make a robust server on the cheap. ES支持各种分片配置，您可以廉价地构建强大的服务器。

One question though, why do you have logs in Elasticsearch that need to be updated with this particular normal/malicious property? 但是，有一个问题，为什么在Elasticsearch中有需要使用此特殊的正常/恶意属性进行更新的日志？ If you're the one putting them into ES then just tag them appropriately before you ever store them to prevent the extra read that's bothering you. 如果您是将它们放入ES的人，则只需在存储它们之前对其进行适当的标记，以防止多余的阅读困扰您。 If that's not an option then you'll still probably be wanting to read ES directly to pull the logs into Python anyway to avoid the enormous overhead of parsing the original log file again. 如果这不是一个选择，那么您可能仍想直接读取ES，以将日志无论如何都提取到Python中，以避免再次解析原始日志文件的巨大开销。

If this is a one-time hotfix to existing ES data while you're rolling out normal/malicious, then don't worry about a 2x speed improvement. 如果这是在发布正常/恶意代码时对现有ES数据的一次性修补程序，则不必担心速度提高2倍。 Just throttle the query if you're concerned about bringing down the cluster. 如果您担心要关闭群集，只需限制查询。 The hotfix will execute eventually, and probably faster than if we keep deliberating about the best option. 该修复程序最终将执行，并且可能比我们继续讨论最佳选择的速度更快。