简体繁体 English

组织日志解析器 function 的最佳方法是什么？

[英]What is the best way to organise a log parser function?

原文 2022-01-05 17:04:19 7 1 python/ oop/ design-patterns

I've been writing a log parser to get some information out of some logs and then use it elsewhere.我一直在编写一个日志解析器来从一些日志中获取一些信息，然后在其他地方使用它。 The idea is to run it over a series of log files and store the useful information in a database for use in the future.这个想法是在一系列日志文件上运行它，并将有用的信息存储在数据库中以备将来使用。 The language I'm using is python(3.8)我使用的语言是 python(3.8)

The types of information extracted from the logs are json-type strings, which I store in dictionaries, normal alphanumeric strings, timestamps(which we convert to datetime objects), integers and floats - sometimes as values in dictionary-type format.从日志中提取的信息类型是 json 类型的字符串，我将其存储在字典中、普通的字母数字字符串、时间戳（我们将其转换为日期时间对象）、整数和浮点数 - 有时作为字典类型格式的值。

I've made a parse_logs(filepath) method that takes a filepath and returns a list of dictionaries with all the messages within them.我创建了一个 parse_logs(filepath) 方法，它接受一个文件路径并返回一个包含所有消息的字典列表。 A message can consist of multiple of the above types, and in order to parse those logs I've written a number of methods to isolate message from the log lines into a list of strings and then manipulate those lists of lines that make up a message to extract various kinds of information.一条消息可以由上述多种类型组成，为了解析这些日志，我编写了许多方法来将消息从日志行隔离到一个字符串列表中，然后操作构成消息的那些行列表提取各种信息。

This has resulted in a main parse_logs(filepath: str) -> list function with multiple helper functions (like extract_datetime_from_header(header_line: str) -> datetime , extract_message(messages: list) -> list and process_message(message: list) -> dict that each does a specific thing, but are not useful to any other part of the project I'm working on as they are very specific to aid this function. The only additional thing I wish to do (right now, at least) is take those messages and save their information in a database.这导致了一个主要的parse_logs(filepath: str) -> list function 具有多个辅助函数（如extract_datetime_from_header(header_line: str) -> datetime ， extract_message(messages: list) -> list和process_message(message: list) -> dict说每个人都做特定的事情，但对我正在从事的项目的任何其他部分都没有用，因为它们非常具体地帮助这个function。我唯一想做的额外的事情（至少现在）是获取这些消息并将其信息保存在数据库中。

-So, there are 2 main ways that I'm thinking of organising my code: One is making a LogParser class and it will have a path to the log and a message list as attributes, and all of the functions as class methods. - 所以，我正在考虑组织我的代码的两种主要方式：一种是制作 LogParser class ，它将有一个日志路径和一个消息列表作为属性，所有功能都作为 class 方法。 (In that case what should the indentation level of the helper classes be? should they be their own methods or should they just be functions defined inside the method they are supposed to enable? ). （在这种情况下，辅助类的缩进级别应该是多少？它们应该是它们自己的方法还是应该只是在它们应该启用的方法中定义的函数？）。 The other is just having a base function(and nesting all helper functions inside it, as I assume that I wouldn't want them imported as standalone functions) and just run that method with only the path as an argument, and it will return the message list to a caller function that will take the list, parse it and move each message in it's place in the database.另一个只是有一个基本函数（并将所有辅助函数嵌套在其中，因为我假设我不希望它们作为独立函数导入）并且只使用路径作为参数运行该方法，它将返回消息列表到调用者 function 将获取列表，解析它并将每条消息移动到数据库中的位置。 -Another thing that I'm considering is whether to use dataclasses instead of dictionaries for the data. - 我正在考虑的另一件事是是否对数据使用数据类而不是字典。 The speed difference won't matter much since it's a script that's gonna run just a few times a day as a cronjob and it won't matter that much if it takes 5 seconds or 20 to run(unless the difference is way more, I've only tested it on log examples of half a MB instead of 4-6 GB that are the expected ones) My final concern is keeping the message objects in-memory and feeding them directly to the database writer.速度差异并不重要，因为它是一个脚本，每天只运行几次作为 cronjob，如果它需要 5 秒或 20 秒来运行也没那么重要（除非差异更大，我'只在半 MB 的日志示例上测试过它，而不是预期的 4-6 GB）我最后关心的是将消息对象保留在内存中并将它们直接提供给数据库编写器。 I've done a bit of testing and estimating and I expect that 150MB seems like a reasonable ceiling for a worst-case scenario (that is a log full of only useful data that's a 40% larger than the current largest log that we have - so even if we scale to 3times that amount, I think that a 16gb RAM machine should be able to handle that without any trouble).我已经做了一些测试和估计，我预计 150MB 似乎是最坏情况下的合理上限（即一个充满有用数据的日志，比我们目前拥有的最大日志大 40% -所以即使我们扩展到这个数量的 3 倍，我认为 16gb RAM 机器应该能够毫无问题地处理这个问题）。

So, with all these said, I'd like to ask for best practices on how to handle organising the code, namely:因此，综上所述，我想询问有关如何处理组织代码的最佳实践，即：

Is the class/oop way a better practice than just writing functions that do the work?类/oop 方式是否比仅仅编写完成工作的函数更好？ Is it more readable/maintainable?它更具可读性/可维护性吗？
Should I use dataclasses or stick to dictionaries?我应该使用数据类还是坚持使用字典？ What are the advantages/disadvantages of both?两者的优点/缺点是什么？ Which is better maintainable and which is more efficient?哪个更易于维护，哪个更高效？
If I care about handling data from the database and not from these objects(dicts or data classes), which is the more efficient way to go?如果我关心处理来自数据库而不是来自这些对象（字典或数据类）的数据，这是 go 更有效的方法？
Is it alright to keep the message objects in-memory until the database transaction is complete or should I handle it in a different manner?在数据库事务完成之前将消息对象保留在内存中是否可以，或者我应该以不同的方式处理它？ I've thought of either doing a single transaction after I finish parsing a single log (but I was told that it could lead to both bad scalability since the temporary list of messages would keep increasing in-memory up to the point where they'd be used in the db transaction - and that a single large transaction could also be in turn slow) or of writing every message as it's parsed(as a dictionary object) in a file in disc and then parse that intermediary(is that the correct word? ) file to the function that will handle the db transactions and do them in batches (I was told that's not a good practice either), or write directly to the db while parsing messages (either after every message or in small batches so that the total message list doesn't get to grow too large).我曾想过在完成对单个日志的解析后执行单个事务（但有人告诉我，这可能会导致可伸缩性差，因为消息的临时列表会不断增加内存中的内容，直到它们用于数据库事务 - 单个大事务也可能反过来很慢）或将每条消息在解析时写入（作为字典对象）到磁盘中的文件中，然后解析该中介（是正确的词? ) 文件到 function 将处理数据库事务并分批执行（有人告诉我这也不是一个好习惯），或者在解析消息时直接写入数据库（在每条消息之后或小批量，以便总消息列表不会变得太大）。 I've even thought of going a producer/consumer route and keep a shared variable that the producer(log parser) will append to while the consumer(database writer) will consume, both until the log is fully parsed.我什至想过走一条生产者/消费者路线并保留一个共享变量，生产者（日志解析器）将 append 到，而消费者（数据库写入者）将消费，直到日志被完全解析。 But this route is not something that I've done before (except for a few times for interview questions, which was rather simplistic and it felt hard to debug or maintain so I don't feel that confident in doing right now).但是这条路线不是我以前做过的事情（除了几次面试问题，这相当简单，调试或维护起来很困难，所以我现在没有那么有信心去做）。 What are the best practices regarding the above?关于上述内容的最佳做法是什么？

Thank you very much for your time, I know it's a bit of a lot that I've asked.非常感谢您抽出宝贵时间，我知道我已经问了很多。 but I did feel like writing down all of the thoughts that I had and read some people's opinions on them, Till then I'm gonna try to do an implementation for all of the above ideas (except perhaps the producer/consumer) and see which feels more maintainable.但我确实想写下我的所有想法并阅读一些人对它们的看法，直到那时我将尝试为上述所有想法（可能除了生产者/消费者）做一个实现，看看哪个感觉更易于维护。 human readable and intuitively correct to me.对我来说，人类可读且直观正确。

1 个解决方案

Is the class/oop way a better practice than just writing functions that do the work?类/oop 方式是否比仅仅编写完成工作的函数更好？ Is it more readable/maintainable?它更具可读性/可维护性吗？

I don't think there's necessarily a best approach.我认为不一定有最好的方法。 I've seen the following work equally well:我已经看到以下工作同样出色：

OOP: You'd have a Parser class which uses instance variables to share the parsing state. OOP：您将拥有一个Parser class，它使用实例变量来共享解析 state。 The parser can be made thread-safe, or not.解析器可以是线程安全的，也可以不是。
Closures: You'd use nested functions to create closures over the input & parsing state.闭包：您将使用嵌套函数在输入和解析 state 上创建闭包。
Functional: You'd pass the input & parsing state to functions which yields back the parsing state (eg AST + updated cursor index).功能：您可以将输入和解析 state 传递给函数，这些函数会返回解析 state（例如 AST + 更新的 cursor 索引）。

Should I use dataclasses or stick to dictionaries?我应该使用数据类还是坚持使用字典？ What are the advantages/disadvantages of both?两者的优点/缺点是什么？ Which is better maintainable and which is more efficient?哪个更易于维护，哪个更高效？

ASTs are usually represented in 2 ways ( homogenous vs heterogenous ): AST 通常以 2 种方式表示（同质与异质）：

Homogeneous: you'd have a single ASTNode { type, children } class to represent all the node types.同构：您将有一个ASTNode { type, children } class 来表示所有节点类型。
Heterogenous: you'd have a concrete node class per type.异构：每种类型都有一个具体的节点 class 。

Your approach is kinda a mix of both, because as a key/value store, dictionaries can be a little more expressive for pointing to other nodes than list indexes, but all nodes are still represented with the same underlying type.您的方法是两者的混合，因为作为键/值存储，字典对于指向其他节点而不是列表索引可能更具表现力，但所有节点仍然用相同的底层类型表示。 I usually favor #2 with custom classes as those are self-documenting the structure of the tree, although in a dynamically typed language there's probably less benefits.我通常喜欢使用自定义类的#2，因为它们可以自我记录树的结构，尽管在动态类型的语言中可能没有多少好处。

As to performance, IDK Python well enough, but quick Googling seems to point out that dictionaries are most performant overall.至于性能，IDK Python 足够好，但快速谷歌搜索似乎指出字典总体上性能最高。

If I care about handling data from the database and not from these objects(dicts or data classes), which is the more efficient way to go?如果我关心处理来自数据库而不是来自这些对象（字典或数据类）的数据，这是 go 更有效的方法？

If in-memory AST consumers are uninteresting and you won't have much AST processing operations then I guess it's a bit less important to invest much time & effort into the AST representation, although if you only have a few kind of nodes making it explicit from the start shouldn't be a huge effort.如果内存中的 AST 消费者不感兴趣并且您不会有太多的 AST 处理操作，那么我想在 AST 表示上投入大量时间和精力就不太重要了，尽管如果您只有几种节点使其明确从一开始就不应该付出巨大的努力。

Is it alright to keep the message objects in-memory until the database transaction is complete...在数据库事务完成之前将消息对象保留在内存中是否可以...

Honestly when you are talking runtime & memory optimizations it really depends.老实说，当您谈论运行时和 memory 优化时，这真的取决于。 I'd say avoid getting trapped into premature optimization .我想说避免陷入过早的优化。 How big those logs are likely to be?这些日志可能有多大？ Would memory overflows be likely? memory 是否可能溢出？ Is the operation so time-consuming that crashing and having to start over unacceptable?操作是否如此耗时以至于崩溃并且不得不重新开始是不可接受的？