简体   繁体   English

使用 awk 和 sed 进行日志解析

[英]Log Parsing using awk and sed

I have我有

2019-11-14T09:42:14.150Z  INFO ActivityEventRecovery-1 ActivityCacheManager - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Handling activity 0082bc26-70a6-433e-a470-
2019-11-14T09:43:08.097Z  INFO L2HostConfigTaskExecutor2 TransportNodeAsyncServiceImpl - FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Calling uplinkTeamingChangeListener.onTransportNodeUpdated on TN 72f73c66-da37-11e9-8d68-005056bce6a5 revision 5
2019-11-14T09:43:08.104Z  INFO L2HostConfigTaskExecutor2 Publisher - ROUTING [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Refresh mac address of Logical router port connected with VLAN LS for logical router LogicalRouter/f672164b-40cf-461f-9c8d-66fe1e7f8c19
2019-11-14T09:43:08.105Z  INFO L2HostConfigTaskExecutor2 GlobalActivityRepository - - [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Submitted activity 73e7a942-73d2-4967-85fa-7d9d6cc6042b in QUEUED state with dependency null exclusivity true and requestId null and time taken by dao.create is 1 ms

these kind of logs where I want to parse it into a json object.我想将这些日志解析为 json object。 Till now, I was using python regex and putting it into a dictionary.到目前为止,我正在使用 python 正则表达式并将其放入字典中。

    currentDict = {
                               "@timestamp" : regexp.group(1),
                               "Severity" : regexp.group(2),
                               "Thread" : regexp.group(3),
                               "Class" : regexp.group(4),
                               "Message-id" : regexp.group(5),
                               "Component" : regexp.group(6),
                               "Message" : regexp.group(7),
                               "id's" : re.findall(x[1], regexp.group(7))
                        }

but this way it is very slow ie it is taking 5-10 mins for 200mb file.但这种方式非常慢,即 200mb 文件需要 5-10 分钟。

Python regex I used - (\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d.\d\d\dZ)\s+(INFO|WARN|DEBUG|ERROR|FATAL|TRACE)\s+(.*?)\s+(.*?)\s+\-\s+(.*?)\s+(?:(\[?.*?\])?)\s(.*)我使用的 Python 正则表达式 - (\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d.\d\d\dZ)\s+(INFO|WARN|DEBUG|ERROR|FATAL|TRACE)\s+(.*?)\s+(.*?)\s+\-\s+(.*?)\s+(?:(\[?.*?\])?)\s(.*)

Expected output -预期 output -

{"@timestamp" : "2019-11-14T09:42:14.150Z", "Sevirity" : "INFO", "Thread" : "ActivityEventRecovery-1", "Class" : "ActivityCacheManager - -", "Component" : "[nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"]", "Message" : "Handling activity 0082bc26-70a6-433e-a470-"}
{"@timestamp" : "2019-11-14T09:43:08.097Z", "Sevirity" : "INFO", "Thread" : "L2HostConfigTaskExecutor2", "Class" : "TransportNodeAsyncServiceImpl - FABRIC", "Component" : "[nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"]", "Message" : "Calling uplinkTeamingChangeListener.onTransportNodeUpdated on TN 72f73c66-da37-11e9-8d68-005056bce6a5 revision 5}"}
{"@timestamp" : "2019-11-14T09:43:08.104Z", "Sevirity" : "INFO", Thread : "L2HostConfigTaskExecutor2", "Class" : "Publisher - ROUTING", "Component" : "[nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"]", Message : "Refresh mac address of Logical router port connected with VLAN LS for logical router LogicalRouter/f672164b-40cf-461f-9c8d-66fe1e7f8c19}"}
{"@timestamp" : "2019-11-14T09:43:08.105Z", "Sevirity" : "INFO", "Thread" :  "L2HostConfigTaskExecutor2", "Class" :   "GlobalActivityRepository", "Component" : "[nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"]", "Messages" : "Submitted activity 73e7a942-73d2-4967-85fa-7d9d6cc6042b in QUEUED state with dependency null exclusivity true and requestId null and time taken by dao.create is 1 ms"}}

On internet, I found out that using awk and sed it can be done faster.在互联网上,我发现使用 awk 和 sed 可以更快地完成。 I don't know much about it.我对此了解不多。 How to do parsing using awk and sed .如何使用awksed进行解析。

Please Help!请帮忙!

# For timestamp
cut -d " " -f 1 in > temp
sed -i -e 's/^/{"@timestamp" : "/' temp
awk 'NF{print $0 "\", "}' temp > a

# For Severity ...

# For Thread ...

# For Class
cut -d " " -f 5,6,7 in > temp
sed -i -e 's/^/"Class" : "/' temp
awk 'NF{print $0 "\", "}' temp > d

# For Component
grep -o -P '(?<=\[).*(?=\])' in > temp
sed -i -e 's/^/"Component" : \["/' temp
awk 'NF{print $0 "\"], "}' temp > e

# For Message ...

# Merge all files line by line
paste -d " " a b c d e f

I'll explain some of this script in short, cut is used to get the word in between two spaces.我将简短地解释这个脚本的一些内容,cut 用于将单词放在两个空格之间。 Sed is adding the text to the beginning of each line. Sed 将文本添加到每行的开头。 Awk is adding text to the end of each line. Awk 正在将文本添加到每行的末尾。

I have left the Severity, Thread and Message section as they are same as the other ones.我离开了严重性、线程和消息部分,因为它们与其他部分相同。 The script is fairly simple, but you won't understand it without knowing how to use the tools itself, since you said you don't know about them.该脚本相当简单,但如果不知道如何使用工具本身,您将无法理解它,因为您说您不了解它们。

This sed script should do a very fast job:这个 sed 脚本应该做得非常快:

sed -E 's/^([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+\s-\s[^ ]+)\s+(\[.*\])\s+(.*)$/{"@timestamp" : "\1", "Severity" : "\2", "Thread" : "\3", "Class" : "\4", "Component" : "\5", "Message" "\6"}/' inputdata.dat

Explanation:解释:

  • sed 's/^<regular-expression>$/<output-string>/' manipulates/substitutes every ^<input-line>$ , that matches the regex. sed 's/^<regular-expression>$/<output-string>/'操作/替换每个匹配正则表达式^<input-line>$ Hereby ^ means begin and $ end of line .特此^表示开始$行尾
  • -E : means use extended regular expressions . -E :表示使用扩展正则表达式 Extended regular expressions are now included in POSIX.扩展的正则表达式现在包含在 POSIX 中。 In grep's manual page can you find:在 grep 的手册页中,您可以找到:

    Basic vs Extended Regular Expressions:基本与扩展正则表达式:

    In basic regular expressions the meta-characters?, +, {, |, (, and ) lose their special meaning;基本的正则表达式中,元字符?、+、{、|、(和)失去了它们的特殊意义; instead use the backslashed versions \?, \+, \{, \|, \(, and \).而是使用反斜杠版本 \?、\+、\{、\|、\( 和 \)。

  • (...)...(...)...(...)... : the content inside (...) is - in this script - a description to match the first, second, third... sixth data field. (...)...(...)...(...)...(...) ...) 中的内容是 - 在此脚本中 - 与第一个、第二个、第三个匹配的描述。 ..第六个数据字段。 In general, it's an instrument to subdivide the whole regular expression into different units that you can refer to in the output string via \1 , \2 ... .通常,它是将整个正则表达式细分为不同单元的工具,您可以通过\1\2 ... 在 output 字符串中引用这些单元。 Doing so, you can can limit text manipulations to certain contexts.这样做,您可以将文本操作限制在某些上下文中。 In this case the data fields themselves stay unchanged, only the context shall change.在这种情况下,数据字段本身保持不变,只有上下文会改变。

  • [^ ]+ : within [...] a class of characters is described [A-Za-z0-9] means exactly one char of alphabet or digits , [0-9]+ means at least one digit , [ ] means one blank , [^0-9 ] means exactly one char - anything else but digit and but blank , so here [^ ]+ means at least one char - anything else but blank -> the right regex pattern for the content of the first three data fields [^ ]+ :在[...]中描述了一个class 字符[A-Za-z0-9]表示恰好是一个字母或数字字符[0-9]+表示至少一位数字[ ]表示一个空白[^0-9 ]表示正好一个字符 - 除了数字之外的任何其他内容,但是空白,所以这里[^ ]+表示至少一个字符 - 除了空白之外的任何其他内容-> 第一个内容的正确正则表达式模式三个数据字段
  • ([^ ]+\s-\s[^ ]+) : the fourth data field "Class" is a compound data field, two components and a separator ' - ' (1st - 2nd), outside of char classes ( [..] ) use \s instead of ' ' ([^ ]+\s-\s[^ ]+) :第四个数据字段“Class”是一个复合数据字段,两个组件和一个分隔符' - ' (1st - 2nd),在 char 类 ( [..] ) 使用\s而不是' '
  • (\[.*\]) : the fifth data field "Component" is also a compound data field but enclosed in square brackets [ and ] . (\[.*\]) :第五个数据字段“组件”也是一个复合数据字段,但包含在方括号[]中。 In order to match the bracket-character itself (not a char class by enclosing those characters in brackets) you have to use the bracket-character [ or ] preceded by a backslash \ .为了匹配括号字符本身(不是通过将这些字符括在括号中的字符 class),您必须使用括号字符[]前面加上反斜杠\ . is a wildcard, so .* in \[.*\] means everything between the brackets是通配符,因此 \[. .* \[.*\]中的 .* 表示括号之间的所有内容
  • \s+ : at least one blank or tabulator (between the data fields) \s+ :至少一个空格或制表符(在数据字段之间)
  • so the sixth data field - the message of flexible length - can be matched as (.*) what means the rest after ] and the directly following whitespace till end of line所以第六个数据字段 - 灵活长度的消息 - 可以匹配为(.*)这意味着rest 之后]和紧随其后的空格直到行尾
  • \1... \6 (in the right part): the references in the output sting to the respecive expression group (in this case the data fields) within the regular expression. \1... \6 (在右侧):output 中对正则表达式中相应表达式组(在本例中为数据字段)的引用。
  • inputdata.dat : replace this by the name of your data file inputdata.dat :将其替换为您的数据文件的名称

In order to get a runnable shell script save this as a file:为了获得可运行的 shell 脚本,将其保存为文件:

#! /bin/sh
sed -E 's/^([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+([^ ]+\s-\s[^ ]+)\s+(\[.*\])\s+(.*)$/{"@timestamp" : "\1", "Severity" : "\2", "Thread" : "\3", "Class" : "\4", "Component" : "\5", "Message" "\6"}/' "$1" >"$2"

After that run command chmod +x <your-scriptname> to make the script executable.之后运行命令chmod +x <your-scriptname>使脚本可执行。 Then it can be run as ./<your-scriptname> <input-file> <wanted-output-file-name> .然后它可以作为./<your-scriptname> <input-file> <wanted-output-file-name>运行。

Attention:注意力:

  • Do NOT use the same filename for input and output file不要对输入和 output 文件使用相同的文件名
  • If a file with the output filename already exists, it will be overridden.如果文件名 output 的文件已经存在,它将被覆盖。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM