简体   繁体   English

寻找反转sprintf()函数输出的算法

[英]Looking for algorithm that reverses the sprintf() function output

I am working on a project that requires the parsing of log files. 我正在开发一个需要解析日志文件的项目。 I am looking for a fast algorithm that would take groups messages like this: 我正在寻找一个快速的算法,将采取像这样的组消息:

The temperature at P1 is 35F. P1的温度为35°F。

The temperature at P1 is 40F. P1的温度为40°F。

The temperature at P3 is 35F. P3的温度为35F。

Logger stopped. 记录器停了下来。

Logger started. 记录器开始了。

The temperature at P1 is 40F. P1的温度为40°F。

and puts out something in the form of a printf(): 并以printf()的形式提出一些东西:

"The temperature at P%d is %dF.", Int1, Int2" 
{(1,35), (1, 40), (3, 35), (1,40)}

The algorithm needs to be generic enough to recognize almost any data load in message groups. 该算法需要足够通用以识别消息组中的几乎任何数据负载。

I tried searching for this kind of technology, but I don't even know the correct terms to search for. 我尝试搜索这种技术,但我甚至不知道要搜索的正确术语。

I think you might be overlooking and missed fscanf() and sscanf(). 我想你可能会忽视并错过了fscanf()和sscanf()。 Which are the opposite of fprintf() and sprintf(). 这与fprintf()和sprintf()相反。

Overview: 概述:

A naïve!! 天真!! algorithm keeps track of the frequency of words in a per-column manner, where one can assume that each line can be separated into columns with a delimiter. 算法以每列方式跟踪单词的频率,其中可以假设每行可以用分隔符分隔成列。

Example input: 输入示例:

The dog jumped over the moon 狗跳过月球
The cat jumped over the moon 猫跳过月球
The moon jumped over the moon 月亮跃过月球
The car jumped over the moon 汽车跳过了月球

Frequencies: 频率:

Column 1: {The: 4}
Column 2: {car: 1, cat: 1, dog: 1, moon: 1}
Column 3: {jumped: 4}
Column 4: {over: 4}
Column 5: {the: 4}
Column 6: {moon: 4}

We could partition these frequency lists further by grouping based on the total number of fields, but in this simple and convenient example, we are only working with a fixed number of fields (6). 我们可以通过基于字段总数的分组来进一步划分这些频率列表,但是在这个简单方便的示例中,我们只使用固定数量的字段(6)。

The next step is to iterate through lines which generated these frequency lists, so let's take the first example. 下一步是迭代生成这些频率列表的行,让我们来看第一个例子。

  1. The : meets some hand-wavy criteria and the algorithm decides it must be static. :符合有些手波浪标准和算法决定它必须是静态的。
  2. dog : doesn't appear to be static based on the rest of the frequency list, and thus it must be dynamic as opposed to static text. dog :基于频率列表的其余部分似乎不是静态的,因此它必须是动态的而不是静态文本。 We loop through a few pre-defined regular expressions and come up with /[az]+/i . 我们遍历一些预定义的正则表达式并提出/[az]+/i
  3. over : same deal as #1; 结束 :与#1相同的交易; it's static, so leave as is. 它是静态的,所以保持原样。
  4. the : same deal as #1; :同一交易#1; it's static, so leave as is. 它是静态的,所以保持原样。
  5. moon : same deal as #1; 月亮 :与#1相同的交易; it's static, so leave as is. 它是静态的,所以保持原样。

Thus, just from going over the first line we can put together the following regular expression: 因此,只需从第一行开始,我们就可以将以下正则表达式组合在一起:

/The ([a-z]+?) jumps over the moon/

Considerations: 注意事项:

  • Obviously one can choose to scan part or the whole document for the first pass, as long as one is confident the frequency lists will be a sufficient sampling of the entire data. 显然,只要有人确信频率列表是对整个数据的充分采样,就可以选择扫描第一遍的部分或整个文档。

  • False positives may creep into the results, and it will be up to the filtering algorithm (hand-waving) to provide the best threshold between static and dynamic fields, or some human post-processing. 误报可能会蔓延到结果中,并且将取决于过滤算法(挥手)以提供静态和动态字段之间的最佳阈值,或者某些人类后处理。

  • The overall idea is probably a good one, but the actual implementation will definitely weigh in on the speed and efficiency of this algorithm. 总体思路可能很好,但实际实施肯定会影响该算法的速度和效率。

Thanks for all the great suggestions. 感谢所有伟大的建议。 Chris, is right. 克里斯,是对的。 I am looking for a generic solution for normalizing any kind of text. 我正在寻找一种通用的解决方案来规范化任何类型的文本。 The solution of the problem boils down to dynmamically finding patterns in two or more similar strings. 该问题的解决方案归结为动态地在两个或更多相似的字符串中找到模式。 Almost like predicting the next element in a set, based on the previous two: 几乎就像预测集合中的下一个元素一样,基于前两个:

1: Everest is 30000 feet high 1:珠穆朗玛峰高30000英尺

2: K2 is 28000 feet high 2:K2高28000英尺

=> What is the pattern? =>模式是什么? => Answer: =>答案:

[name] is [number] feet high [名字]是[数字]英尺高

Now the text file can have millions of lines and thousands of patterns. 现在,文本文件可以有数百万行和数千种模式。 I would like to parse the files very, very fast, find the patterns and collect the data sets that are associated with each pattern. 我想非常非常快速地解析文件,找到模式并收集与每个模式相关的数据集。

I thought about creating some high level semantic hashes to represent the patterns in the message strings. 我想过创建一些高级语义哈希来表示消息字符串中的模式。 I would use a tokenizer and give each of the tokens types a specific "weight". 我会使用一个标记器,给每个标记类型一个特定的“权重”。 Then I would group the hashes and rate their similarity. 然后我会对哈希进行分组并评估它们的相似性。 Once the grouping is done I would collect the data sets. 分组完成后,我会收集数据集。

I was hoping, that I didn't have to reinvent the wheel and could reuse something that is already out there. 我希望,我没有必要重新发明轮子,可以重复使用已有的东西。

Klaus 克劳斯

It depends on what you are trying to do, if your goal is to quickly generate sprintf() input, this works. 这取决于你想要做什么,如果你的目标是快速生成sprintf()输入,这是有效的。 If you are trying to parse data, maybe regular expressions would do too.. 如果你试图解析数据,也许正则表达式也会这样做。

You're not going to find a tool that can simply take arbitrary input, guess what data you want from it, and produce the output you want. 你不会找到一个可以简单地接受任意输入的工具,从中猜出你想要的数据,并产生你想要的输出。 That sounds like strong AI to me. 这对我来说听起来像是强大的AI。

Producing something like this, even just to recognize numbers, gets really hairy. 制作这样的东西,即使只是为了识别数字,也会变得非常毛茸茸。 For example is "123.456" one number or two? 例如“123.456”一个或两个? How about this "123,456"? 这个“123,456”怎么样? Is "35F" a decimal number and an 'F' or is it the hex value 0x35F? “35F”是十进制数和“F”还是十六进制值0x35F? You're going to have to build something that will parse in the way you need. 你将不得不构建一些能够以你需要的方式解析的东西。 You can do this with regular expressions, or you can do it with sscanf , or you can do it some other way, but you're going to have to write something custom. 您可以使用正则表达式执行此操作,或者您可以使用sscanf执行此操作,或者您可以通过其他方式执行此操作,但您将不得不编写自定义的内容。

However, with basic regular expressions, you can do this yourself. 但是,使用基本正则表达式,您可以自己执行此操作。 It won't be magic, but it's not that much work. 它不会是魔术,但它不是那么多工作。 Something like this will parse the lines you're interested in and consolidate them (Perl): 这样的东西会解析你感兴趣的行并合并它们(Perl):

my @vals = ();
while (defined(my $line = <>))
{
    if ($line =~ /The temperature at P(\d*) is (\d*)F./)
    {
        push(@vals, "($1,$2)");
    }
}
print "The temperature at P%d is %dF. {";
for (my $i = 0; $i < @vals; $i++)
{
    print $vals[$i];
    if ($i < @vals - 1)
    {
        print ",";
    }
}
print "}\n";

The output from this isL 这个输出是L.

The temperature at P%d is %dF. {(1,35),(1,40),(3,35),(1,40)}

You could do something similar for each type of line you need to parse. 您可以为需要解析的每种类型的行执行类似的操作。 You could even read these regular expressions from a file, instead of custom coding each one. 您甚至可以从文件中读取这些正则表达式,而不是每个都自定义编码。

I don't know of any specific tool to do that. 我不知道有任何具体工具可以做到这一点。 What I did when I had a similar problem to solve was trying to guess regular expressions to match lines. 当我遇到类似的问题时,我所做的就是尝试猜测正则表达式来匹配线条。

I then processed the files and displayed only the unmatched lines. 然后我处理文件并只显示不匹配的行。 If a line is unmatched, it means that the pattern is wrong and should be tweaked or another pattern should be added. 如果一行不匹配,则意味着该模式是错误的,应该进行调整或添加其他模式。

After around an hour of work, I succeeded in finding the ~20 patterns to match 10000+ lines. 经过大约一个小时的工作,我成功地找到了~20个模式,以匹配10000多行。

In your case, you can first "guess" that one pattern is "The temperature at P[1-3] is [0-9]{2}F." 在您的情况下,您可以首先“猜测”一个模式是"The temperature at P[1-3] is [0-9]{2}F." . If you reprocess the file removing any matched line, it leaves "only": 如果您重新处理删除任何匹配行的文件,它将“仅”保留:

Logger stopped. 记录器停了下来。

Logger started. 记录器开始了。

Which you can then match with "Logger (.+)." 然后,您可以将其与"Logger (.+)."匹配"Logger (.+)." .

You can then refine the patterns and find new ones to match your whole log. 然后,您可以优化模式并找到新模式以匹配整个日志。

@John: I think that the question relates to an algorithm that actually recognises patterns in log files and automatically "guesses" appropriate format strings and data for it. @John:我认为这个问题涉及一种实际识别日志文件中的模式并自动“猜测”适当的格式字符串和数据的算法。 The *scanf family can't do that on its own, it can only be of help once the patterns have been recognised in the first place. *scanf系列不能单独执行此操作,只有在首先识别出模式后才能提供帮助。

http://www.logparser.com forwards to an IIS forum which seems fairly active. http://www.logparser.com转发到一个看似相当活跃的IIS论坛。 This is the official site for Gabriele Giuseppini's "Log Parser Toolkit". 这是Gabriele Giuseppini的“Log Parser Toolkit”的官方网站。 While I have never actually used this tool, I did pick up a cheap copy of the book from Amazon Marketplace - today a copy is as low as $16. 虽然我从未真正使用过这个工具,但我确实从亚马逊商城购买了这本书的廉价副本 - 今天的副本价格低至16美元。 Nothing beats a dead-tree-interface for just flipping through pages. 没有什么比只翻翻页面更糟糕的死树界面了。

Glancing at this forum, I had not previously heard about the "New GUI tool for MS Log Parser, Log Parser Lizard" at http://www.lizardl.com/ . 看看这个论坛,我以前没有听说过http://www.lizardl.com/上的 “MS Log Parser,Log Parser Lizard的新GUI工具”。

The key issue of course is the complexity of your GRAMMAR. 当然,关键问题是GRAMMAR的复杂性。 To use any kind of log-parser as the term is commonly used, you need to know exactly what you're scanning for, you can write a BNF for it. 要使用任何类型的日志解析器作为常用术语,您需要准确知道要扫描的内容,您可以为其编写BNF。 Many years ago I took a course based on Aho-and-Ullman's "Dragon Book", and the thoroughly understood LALR technology can give you optimal speed, provided of course that you have that CFG. 很多年前,我参加了一个基于Aho-and-Ullman的“龙书”的课程,完全理解的LALR技术可以为你提供最佳速度,前提是你有CFG。

On the other hand it does seem you're possibly reaching for something AI-like, which is a different order of complexity entirely. 另一方面,看起来你可能正在寻找类似AI的东西,这完全是复杂的不同顺序。

@Derek Park: Well, even a strong AI couldn't be sure it had the right answer. @Derek Park:嗯,即使是强大的AI也不能确定它有正确的答案。

Perhaps some compression-like mechanism could be used: 也许可以使用一些类似压缩的机制:

  1. Find large, frequent substrings 找到大而频繁的子串
  2. Find large, frequent substring patterns. 找到大而频繁的子串模式。 (ie [pattern:1] [junk] [pattern:2]) (即[模式:1] [垃圾] [模式:2])

Another item to consider might be to group lines by edit-distance . 要考虑的另一个项目可能是按编辑距离对行进行分组。 Grouping similar lines should split the problem into one-pattern-per-group chunks. 对类似行进行分组应该将问题拆分为每个模块一个模式块。

Actually, if you manage to write this, let the whole world know , I think a lot of us would like this tool! 实际上,如果你设法写这个, 让全世界都知道 ,我想我们很多人都会喜欢这个工具!

@Anders @Anders

Well, even a strong AI couldn't be sure it had the right answer. 好吧,即使是强大的AI也不能确定它有正确的答案。

I was thinking that sufficiently strong AI could usually figure out the right answer from the context. 我认为足够强大的AI 通常可以从上下文中找出正确的答案。 eg Strong AI could recognize that "35F" in this context is a temperature and not a hex number. 例如,强AI可以识别出此上下文中的“35F”是温度而不是十六进制数。 There are definitely cases where even strong AI would be unable to answer. 肯定有强大的AI无法回答的情况。 Those are the same cases where a human would be unable to answer, though (assuming very strong AI). 这些都是相同的情况下,人类将无法答复,但(假设非常强AI)。

Of course, it doesn't really matter, since we don't have strong AI. 当然,这并不重要,因为我们没有强大的人工智能。 :) :)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM