在sed和awk中使用正则表达式

Question

I have to use regex with sed or awk to find things in a log file. 我必须使用带有sed或awk的正则表达式在日志文件中查找内容。 The log file like this. 这样的日志文件。

Jan 16 08:33:18 mail.knurledwidgets.example.org sendmail[1618]: qhgKT0cN80gSX: to=<user1@company.example.com>, delay=00:00:02, xdelay=00:00:01, mailer=esmtp, pri=193069, relay=mx.company.example.com. [192.168.123.12], dsn=2.0.0, stat=Sent (OK <sp4jffaeid3FxjPGr@mx.company.example.com>)
Jan 16 08:33:04 mail.knurledwidgets.example.org sendmail[3539]: q5c1SrFqkAZq9b: Milter: connect to filters
Jan 16 08:33:06 mail.knurledwidgets.example.org sendmail[3539]: q5c1SrFqkAZq9b: from=<user1@dont-cross-the-memes.example.com>, size=38065260, class=-30, nrcpts=1, msgid=<gnDSaYSEaP4Yk/.F0EhYbIYcihGO8Vd.dont-cross-the-memes.example.com>, proto=ESMTP, daemon=MTA-v6, relay=proton.dont-cross-the-memes.example.com [192.168.98.234]

Those are three main form in the log file. 这些是日志文件中的三种主要形式。 Since I have to find the mail received which means the email which has a "from" before the email. 因为我必须找到收到的邮件，这意味着该邮件在邮件之前有一个“发件人”。 I have write a regex like this. 我写了这样的正则表达式。

^Jan\s\d\d\s(\d\d).*\bfrom\b\=<(.*)>,\s\bsize\b.*

I have test this regex using the TextWrangler. 我已经使用TextWrangler测试了此正则表达式。 It can find all the email and replace them to "hour" "email address". 它可以找到所有电子邮件并将其替换为“小时”“电子邮件地址”。

However when I trying to using this regex in the sed or awk to write a script. 但是，当我尝试在sed或awk中使用此正则表达式编写脚本时。 I have a few problem about my code. 我的代码有一些问题。

This is Sed: 这是Sed：

#!/bin/bash
sed -E 's/^Jan\s\d\d\s(\d\d).*\bfrom\b\=<(.*)>,\s\bsize\b.*/\1 \2/g' output

I don't know why this code doesn't work. 我不知道为什么这段代码行不通。 It doesn't replace anything. 它不会替代任何东西。 How do I fix this problem? 我该如何解决这个问题？ Maybe awk is a better choice? 也许awk是更好的选择？

Answer 1

I usually find it convenient when parsing input with name=value data to create an array that lets me simply access the values by their names, eg: 通常，在解析具有name = value数据的输入以创建一个数组时，它很方便，该数组使我可以简单地通过它们的名称访问值，例如：

$ cat tst.awk
{
    delete n2v
    for (i=1; i<=NF; i++) {
        if ($i ~ /=/) {
            name = value = $i
            sub(/=.*/,"",name)
            sub(/[^=]+=/,"",value)
            gsub(/^<|[>,]+$/,"",value)
            n2v[name] = value
        }
    }

    for (name in n2v) {
        value = n2v[name]
        print ">", name, "=", value
    }
    print "-----"
}
"from" in n2v { print $1, $2, $3, n2v["from"] }

. 。

$ awk -f tst.awk file
> stat = Sent
> relay = mx.company.example.com.
> xdelay = 00:00:01
> to = user1@company.example.com
> dsn = 2.0.0
> mailer = esmtp
> delay = 00:00:02
> pri = 193069
-----
-----
> from = user1@dont-cross-the-memes.example.com
> relay = proton.dont-cross-the-memes.example.com
> nrcpts = 1
> class = -30
> size = 38065260
> proto = ESMTP
> msgid = gnDSaYSEaP4Yk/.F0EhYbIYcihGO8Vd.dont-cross-the-memes.example.com
> daemon = MTA-v6
-----
Jan 16 08:33:06 user1@dont-cross-the-memes.example.com

Answer 2

您也可以使用awk（假设可以在“ from = <”上进行匹配，并且字段的顺序相同）

awk -F'[ :<>,]' '/ from=</ {print $3 " " $12}' output

Answer 3

I think the problem is with \\d syntax. 我认为问题出在\\d语法。 It does not mean what you think. 这并不代表您的想法。 In sed it is followed by decimal values that matches a character, so it causes your regex to fail. 在sed ，其后是与字符匹配的十进制值，因此它会导致您的正则表达式失败。 Replace them with [0-9] , like: 将其替换为[0-9] ，例如：

sed -r 's/^Jan\s[0-9]{2}\s([0-9]{2}).*\bfrom\b=<(.*)>,\s\bsize\b.*/\1 \2/g' output

Note that I use -r switch, because I don't know what -E means. 请注意，我使用-r开关，因为我不知道-E含义。

For the unique line that matches (the third one), yields: 对于匹配的唯一行（第三行），产生：

08 user1@dont-cross-the-memes.example.com

在sed和awk中使用正则表达式

问题描述

3 个解决方案

解决方案1
4 2015-02-20 23:32:22

解决方案2
1 2015-02-20 23:27:54

解决方案3
0 2015-02-20 23:16:48

在sed和awk中使用正则表达式

问题描述

3 个解决方案

解决方案1 4 2015-02-20 23:32:22

解决方案2 1 2015-02-20 23:27:54

解决方案3 0 2015-02-20 23:16:48

解决方案1
4 2015-02-20 23:32:22

解决方案2
1 2015-02-20 23:27:54

解决方案3
0 2015-02-20 23:16:48