[英]Merge lines which don't match a regex
I have a file which contains logs from the web; 我有一个文件,其中包含来自网络的日志; a simplified version of it is as follows:
其简化版本如下:
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
Unix
Linux
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
START
Solaris
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
Aix
SCO
I have tried a couple of Regex combinations to identify the Accept-Language which is the beginning of every line using the following with awk/sed: 我已经尝试过几种正则表达式组合,以使用awk / sed使用以下命令来识别接受语言,即每行的开头:
/^[a-z]{2}(-[A-Z]{2})?/
/\*|[A-Z]{1,8}(-[A-Z0-9]{1,8})*/i
/([^-;]*)(?:-([^;]*))?(?:;q=([0-9]\.[0-9]))?/
So far I haven't managed to get either awk/sed to give me the following results: 到目前为止,我还没有设法通过awk / sed来获得以下结果:
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd; Unix Linux
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd; STAR Solaris
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd; Aix SCO
Any help is appreciated. 任何帮助表示赞赏。 The file contains about 1 Million+ records so I'm happy to go down a route that doesn't use sed/awk and improves performance.
该文件包含大约100万条记录,因此我很乐意走一条不使用sed / awk并提高性能的路线。
Based on the observation, that we can distinguish the two types of lines on the =
, you can use this awk script: 基于观察,我们可以区分
=
上的两种类型的行,可以使用以下awk脚本:
file.awk file.awk
$0 ~ /=/ { printf("%s%s", v,$0)
v="\n"
next
}
{ printf("\t%s", $0) }
END { printf("\n") }
You use it like this: awk -f file.awk yourfile
您可以这样使用它:
awk -f file.awk yourfile
v
is empty for the first line, later it contains the linebreak v
对于第一行为空,之后包含换行符 =
, we print $0
preceded by v
=
,我们在v
之前打印$0
next
in the first action), we print $0
without the newline but with a \\t
as separation next
操作中的next
行),我们在不使用换行符但以\\t
分隔的情况下打印$0
Just for fun, here's a sed solution: 只是为了好玩,这是一个sed解决方案:
sed -ne 1bgo \
-e '/^[a-z][a-z]-[A-Z][A-Z]/ { x;p;s/.*//;x; };:go' \
-e 'H;x;s/^\n//;s/\n/ /;x;${ x;p; }' < input
It works like this: 它是这样的:
Read each line but instead of printing it right away, save it by appending it to the hold space ( H
), except remove any newlines that separate it from whatever was already there ( x;s/^\\n//;s/\\n/ /;x
). 阅读每一行,但不要立即打印,而是通过将其添加到保留空间(
H
)进行保存,除了删除将其与已有内容分开的所有换行符( x;s/^\\n//;s/\\n/ /;x
)。 (If you want tabs in your output, put them here where I've put a couple of spaces.) (如果要在输出中使用制表符,请将其放置在我已放置几个空格的位置。)
If you come across a line that matches your Accept-Language pattern, flush the hold space before you append anything to it. 如果遇到与“接受语言”模式匹配的行,请在向其添加任何内容之前冲洗保留空间。 Print it and clear it (
x;p;s/.*//;x
). 打印并清除它(
x;p;s/.*//;x
)。 Then proceed as usual with the appending and whatnot. 然后像往常一样进行追加和其他操作。
Treat the first and last lines differently from all others: never flush the hold space after reading just the first line ( 1bgo
skips past that, down to the position labeled :go
), and always flush the hold space after reading the last line ( ${ x;p; }
) 将第一行和最后一行与其他所有行区别对待:仅读取第一行后,切勿刷新保持空间(
1bgo
跳过该行,下降到标记为:go
的位置),并在读取最后一行后始终刷新保持空间( ${ x;p; }
)
$ awk '/[a-z]{2}-[A-Z]{2}/ { print b; b=$0; next } # @xx-XX empty buffer, refill
{ b=b OFS $0 } # otherwise append to buffer
END { print b }' file # dump the buffer in the end
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd; Unix Linux
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd; START Solaris
en-GB,en-US;q=0.8,en jsdjpksdkskd;lkskd; Aix SCO
You will get an empty line to start the output with. 您将获得一个空行以开始输出。 Also, use tab delimiter on output if so desired:
awk -v OFS="\\t" ...
. 另外,如果需要,在输出上使用制表符定界符:
awk -v OFS="\\t" ...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.