如何保留第一行和最后一行并在日志文件中删除

Question

I'm running several spark jobs which produces like log below when the job is waiting for resources.我正在运行几个火花作业，当作业正在等待资源时，它们会产生如下日志。

22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:27 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:29 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:31 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:33 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:35 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:37 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again

After the job completes, I want to reduce the log file size by removing the redundant msgs by executing some perl command.作业完成后，我想通过执行一些 perl 命令删除多余的消息来减小日志文件的大小。 I want the output like below with first and last line alone.我想要 output 如下所示，仅包含第一行和最后一行。

22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again

I tried something like below but since the timestamp is changing, I'm not able to use the greedy operator.我尝试了类似下面的方法，但由于时间戳正在更改，我无法使用贪婪运算符。

perl -0777 -ne ' { s/(^\d\d\/\d\d\/\d\d \d\d:.+? Could not get any http protocol, using HTTP and will try to get protocol again.)+/$1/mg;print } ' log-file.

In actual scenario, the log section could repeat multiple times.在实际场景中，日志部分可能会重复多次。 Something like this.像这样的东西。

sometext1
sometext2
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:27 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:29 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:31 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:33 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:35 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:37 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext3
sometext4
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:53 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext5
sometext6

Can this be solved using regex?这可以使用正则表达式解决吗？

Required output:所需 output：

sometext1
sometext2
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext3
sometext4
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:53 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext5
sometext6

Answer 1

You can try this regex:你可以试试这个正则表达式：

^(\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}(.*$)\s*)(\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}\2\s*)+

Click for Demo点击演示

Explanation:解释：

^ - matches the start of a line ^ - 匹配行首
(\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}(.*$)\s*) - First log line is stored as group 1 (\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}(.*$)\s*) - 第一个日志行存储为组 1
- (\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2} - matches the pattern of format XX/XX/XX XX:XX:XX where X is a digit (\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2} - 匹配格式XX/XX/XX XX:XX:XX的模式XX/XX/XX XX:XX:XX其中X是一个数字
- (.*$) - matches everything until the end of the line. (.*$) - 匹配所有内容，直到行尾。 Whatever is matched is stored in Group 2 .任何匹配的内容都存储在Group 2中。 The actual log(without the timestamp) is stored in this group.实际日志（没有时间戳）存储在该组中。
- \s* - matches 0 or more whitespaces \s* - 匹配 0 个或多个空格
(\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}\2\s*)+ - matches all the remaining continuous log lines starting with the format XX/XX/XX XX:XX:XX followed by contents of group 2 but only the last such log line will be stored in group 3 (\d{2}(?:\/\d{2}){2} \d{2}(?::\d{2}){2}\2\s*)+ - 匹配所有剩余的以XX/XX/XX XX:XX:XX格式开头的连续日志行，后跟第 2 组的内容，但只有最后一个这样的日志行将存储在第 3 组中

Now, replace each match with contents of group 1 followed by group 3 $1$3现在，用第 1 组的内容替换每个匹配项，然后是第 3 组$1$3

Answer 2

While using a regex may be possible, this can easily be solved with normal Perl code.虽然可以使用正则表达式，但可以使用普通的 Perl 代码轻松解决。 I think the code is clearer and easier to maintain.我认为代码更清晰，更易于维护。 I added 3 lines to your sample input to test for the edge case that we end on a line which matches our search.我在您的示例输入中添加了 3 行，以测试我们在与搜索匹配的行上结束的边缘情况。

use strict;
use warnings;

# This string can be replaced as needed
my $str = "INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again";

my ($first, $last);

while (<DATA>) {
    if (/\Q$str/) {                      # if pattern matches current line
        if ($first) {                    # if this is an "in between" line
            $last = $_;                  # save line and go next
        } else {                         # if this is the first line
            print if not eof;            # print it..
            $first = $_;                 # ...save line and go next
        }
        print if eof;                    # print last line to avoid edge cases
    } elsif ($first && $last) {          # $str didn't match: finished a range of lines
        print $last, $_;                 # print and reset
        $first = undef;
        $last = undef;
    } else {
        print;                           # print everything else
    }
}

__DATA__
sometext1
sometext2
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:27 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:29 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:31 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:33 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:35 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:37 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext3
sometext4
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:53 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext5
sometext6
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again

Output: Output：

sometext1
sometext2
22/01/03 14:42:25 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:39 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext3
sometext4
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:53 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
sometext5
sometext6
22/01/03 14:42:49 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again
22/01/03 14:42:51 INFO rpc.b: Could not get any http protocol, using HTTP and will try to get protocol again

Answer 3

A regex solution, for diversity一个正则表达式解决方案，用于多样性

use warnings;
use strict;
use feature 'say';

die "Usage: $0 file\n" if not @ARGV;

my $fc = do { local $/; <> };  # file contents

my $ts = qr{[0-9]{2}/[0-9]{2}/[0-9]{2} [0-9]{2}:[0-9]{2}:[0-9]{2}};

$fc =~ s{ ($ts\s*(.*?)\n) (?: $ts\s*\g{-1}\n )+ ( $ts\s*\g{-2}\n ) }{ $1$3 }gsx;

It matches a timestamp, then the rest of the line;它匹配一个时间戳，然后是该行的 rest； then, a timestamp + what-it-last-captured (via \g{-1} , a relative backreference ), as many times as possible, up to another such pattern (since we need the first and the last).然后，时间戳 + 上次捕获的内容（通过\g{-1} ，相对反向引用），尽可能多次，直到另一个这样的模式（因为我们需要第一个和最后一个）。 So it'll then stop at any other text.所以它会在任何其他文本处停止。

This can be squeezed into an one-liner, if for some reason that's required.如果出于某种原因需要，可以将其压缩为单线。

如何保留第一行和最后一行并在日志文件中删除

问题描述

3 个解决方案

解决方案1
2 2022-01-17 12:49:30

解决方案2
2 2022-01-17 16:29:36

解决方案3
0 2022-01-18 00:12:09

如何保留第一行和最后一行并在日志文件中删除

问题描述

3 个解决方案

解决方案1 2 2022-01-17 12:49:30

解决方案2 2 2022-01-17 16:29:36

解决方案3 0 2022-01-18 00:12:09

解决方案1
2 2022-01-17 12:49:30

解决方案2
2 2022-01-17 16:29:36

解决方案3
0 2022-01-18 00:12:09