简体   繁体   English

如何保留第一行和最后一行并在日志文件中删除

[英]How to retain first and last line and remove in between in logfile

I'm running several spark jobs which produces like log below when the job is waiting for resources.我正在运行几个火花作业,当作业正在等待资源时,它们会产生如下日志。

22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:27 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:29 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:31 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:33 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:35 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:37 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again

After the job completes, I want to reduce the log file size by removing the redundant msgs by executing some perl command.作业完成后,我想通过执行一些 perl 命令删除多余的消息来减小日志文件的大小。 I want the output like below with first and last line alone.我想要 output 如下所示,仅包含第一行和最后一行。

22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again

I tried something like below but since the timestamp is changing, I'm not able to use the greedy operator.我尝试了类似下面的方法,但由于时间戳正在更改,我无法使用贪婪运算符。

perl -0777 -ne ' { s/(^\d\d\/\d\d\/\d\d \d\d:.+? Could not get any http protocol, using HTTP and will try to get protocol again.)+/$1/mg;print } ' log-file.

In actual scenario, the log section could repeat multiple times.在实际场景中,日志部分可能会重复多次。 Something like this.像这样的东西。

sometext1
sometext2
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:27 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:29 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:31 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:33 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:35 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:37 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext3
sometext4
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:53 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext5
sometext6

Can this be solved using regex?这可以使用正则表达式解决吗?

Required output:所需 output:

sometext1
sometext2
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext3
sometext4
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:53 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext5
sometext6

You can try this regex:你可以试试这个正则表达式:

^(\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}(.*$)\s*)(\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}\2\s*)+

Click for Demo点击演示


Explanation:解释:

  • ^ - matches the start of a line ^ - 匹配行首
  • (\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}(.*$)\s*) - First log line is stored as group 1 (\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}(.*$)\s*) - 第一个日志行存储为组 1
    • (\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2} - matches the pattern of format XX/XX/XX XX:XX:XX where X is a digit (\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2} - 匹配格式XX/XX/XX XX:XX:XX的模式XX/XX/XX XX:XX:XX其中X是一个数字
    • (.*$) - matches everything until the end of the line. (.*$) - 匹配所有内容,直到行尾。 Whatever is matched is stored in Group 2 .任何匹配的内容都存储在Group 2中。 The actual log(without the timestamp) is stored in this group.实际日志(没有时间戳)存储在该组中。
    • \s* - matches 0 or more whitespaces \s* - 匹配 0 个或多个空格
  • (\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}\2\s*)+ - matches all the remaining continuous log lines starting with the format XX/XX/XX XX:XX:XX followed by contents of group 2 but only the last such log line will be stored in group 3 (\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}\2\s*)+ - 匹配所有剩余的以XX/XX/XX XX:XX:XX格式开头的连续日志行,后跟第 2 组的内容,但只有最后一个这样的日志行将存储在第 3 组

Now, replace each match with contents of group 1 followed by group 3 $1$3现在,用第 1 组的内容替换每个匹配项,然后是第 3 组$1$3

While using a regex may be possible, this can easily be solved with normal Perl code.虽然可以使用正则表达式,但可以使用普通的 Perl 代码轻松解决。 I think the code is clearer and easier to maintain.我认为代码更清晰,更易于维护。 I added 3 lines to your sample input to test for the edge case that we end on a line which matches our search.我在您的示例输入中添加了 3 行,以测试我们在与搜索匹配的行上结束的边缘情况。

use strict;
use warnings;

# This string can be replaced as needed
my $str = "INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again";

my ($first, $last);

while (<DATA>) {
    if (/\Q$str/) {                      # if pattern matches current line
        if ($first) {                    # if this is an "in between" line
            $last = $_;                  # save line and go next
        } else {                         # if this is the first line
            print if not eof;            # print it..
            $first = $_;                 # ...save line and go next
        }
        print if eof;                    # print last line to avoid edge cases
    } elsif ($first && $last) {          # $str didn't match: finished a range of lines
        print $last, $_;                 # print and reset
        $first = undef;
        $last = undef;
    } else {
        print;                           # print everything else
    }
}

__DATA__
sometext1
sometext2
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:27 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:29 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:31 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:33 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:35 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:37 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext3
sometext4
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:53 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext5
sometext6
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again

Output: Output:

sometext1
sometext2
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext3
sometext4
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:53 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext5
sometext6
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again

A regex solution, for diversity一个正则表达式解决方案,用于多样性

use warnings;
use strict;
use feature 'say';

die "Usage: $0 file\n" if not @ARGV;

my $fc = do { local $/; <> };  # file contents

my $ts = qr{[0-9]{2}/[0-9]{2}/[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}};

$fc =~ s{ ($ts\s*(.*?)\n) (?: $ts\s*\g{-1}\n )+ ( $ts\s*\g{-2}\n ) }{ $1$3 }gsx;

It matches a timestamp, then the rest of the line;它匹配一个时间戳,然后是该行的 rest; then, a timestamp + what-it-last-captured (via \g{-1} , a relative backreference ), as many times as possible, up to another such pattern (since we need the first and the last).然后,时间戳 + 上次捕获的内容(通过\g{-1}相对反向引用),尽可能多次,直到另一个这样的模式(因为我们需要第一个和最后一个)。 So it'll then stop at any other text.所以它会在任何其他文本处停止。

This can be squeezed into an one-liner, if for some reason that's required.如果出于某种原因需要,可以将其压缩为单线。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 查找一行的第一引号和最后引号之间的文本 - Finding the text between first and last quote of a line 正则表达式删除多行注释中的第一行和最后一行文本 - RegEx to remove the first line and last line of text in a multiline comment Python - 如何查找和删除由多组括号组成的字符串的第一个和最后一个右括号 - Python - How to find and remove the first and last closing bracket of a string that consists of many sets of brackets in between bash(sed或awk首选)删除第一个和最后一个实例之间的所有内容 - bash (sed or awk preferred) to remove everything between first and last instance 正则表达式:如何删除最后一个空行? - Regex: how to remove the last, empty line? python重新删除一行中&lt;(不包括)和最后一个/(包括)之间的文本 - python re remove text between < (exclusive) and last / (inclusive) in a line 删除第一个和最后一个反斜杠? - Remove first and last backslash? 如何删除表达式首次出现之间的字符? - How to remove characters between the first occurrences of an expression? 如何在Java中删除句子的第一次和最后一次出现 - how do i remove first and last occurrence of the sentence in java 如何删除所有 <br> 从第一段开始,最后使用正则表达式 - How to remove all the <br> from the first paragraph and last using regex
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM