简体   繁体   English

合并与正则表达式不匹配的行

[英]Merge lines which don't match a regex

I have a file which contains logs from the web; 我有一个文件,其中包含来自网络的日志; a simplified version of it is as follows: 其简化版本如下:

en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
Unix
Linux
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
START
Solaris
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
Aix
SCO

I have tried a couple of Regex combinations to identify the Accept-Language which is the beginning of every line using the following with awk/sed: 我已经尝试过几种正则表达式组合,以使用awk / sed使用以下命令来识别接受语言,即每行的开头:

/^[a-z]{2}(-[A-Z]{2})?/
/\*|[A-Z]{1,8}(-[A-Z0-9]{1,8})*/i  
/([^-;]*)(?:-([^;]*))?(?:;q=([0-9]\.[0-9]))?/

So far I haven't managed to get either awk/sed to give me the following results: 到目前为止,我还没有设法通过awk / sed来获得以下结果:

en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;    Unix    Linux
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;    STAR    Solaris
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;    Aix    SCO

Any help is appreciated. 任何帮助表示赞赏。 The file contains about 1 Million+ records so I'm happy to go down a route that doesn't use sed/awk and improves performance. 该文件包含大约100万条记录,因此我很乐意走一条不使用sed / awk并提高性能的路线。

Based on the observation, that we can distinguish the two types of lines on the = , you can use this awk script: 基于观察,我们可以区分=上的两种类型的行,可以使用以下awk脚本:

file.awk file.awk

$0 ~ /=/ { printf("%s%s", v,$0)
           v="\n"
           next
         } 
         { printf("\t%s", $0) } 
END      { printf("\n") }

You use it like this: awk -f file.awk yourfile 您可以这样使用它: awk -f file.awk yourfile

  • v is empty for the first line, later it contains the linebreak v对于第一行为空,之后包含换行符
  • for lines with an = , we print $0 preceded by v 对于带有= ,我们在v之前打印$0
  • for the other lines (note the next in the first action), we print $0 without the newline but with a \\t as separation 对于其他行(请注意第next操作中的next行),我们在不使用换行符但以\\t分隔的情况下打印$0

Just for fun, here's a sed solution: 只是为了好玩,这是一个sed解决方案:

sed -ne 1bgo \
   -e '/^[a-z][a-z]-[A-Z][A-Z]/ { x;p;s/.*//;x; };:go' \
   -e 'H;x;s/^\n//;s/\n/  /;x;${ x;p; }' < input

It works like this: 它是这样的:

  • Read each line but instead of printing it right away, save it by appending it to the hold space ( H ), except remove any newlines that separate it from whatever was already there ( x;s/^\\n//;s/\\n/ /;x ). 阅读每一行,但不要立即打印,而是通过将其添加到保留空间( H )进行保存,除了删除将其与已有内容分开的所有换行符( x;s/^\\n//;s/\\n/ /;x )。 (If you want tabs in your output, put them here where I've put a couple of spaces.) (如果要在输出中使用制表符,请将其放置在我已放置几个空格的位置。)

  • If you come across a line that matches your Accept-Language pattern, flush the hold space before you append anything to it. 如果遇到与“接受语言”模式匹配的行,请在向其添加任何内容之前冲洗保留空间。 Print it and clear it ( x;p;s/.*//;x ). 打印并清除它( x;p;s/.*//;x )。 Then proceed as usual with the appending and whatnot. 然后像往常一样进行追加和其他操作。

  • Treat the first and last lines differently from all others: never flush the hold space after reading just the first line ( 1bgo skips past that, down to the position labeled :go ), and always flush the hold space after reading the last line ( ${ x;p; } ) 将第一行和最后一行与其他所有行区别对待:仅读取第一行后,切勿刷新保持空间( 1bgo跳过该行,下降到标记为:go的位置),并在读取最后一行后始终刷新保持空间( ${ x;p; }

$ awk '/[a-z]{2}-[A-Z]{2}/ { print b; b=$0; next }  # @xx-XX empty buffer, refill
                           { b=b OFS $0 }           # otherwise append to buffer
                       END { print b }' file        # dump the buffer in the end

en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd;
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd; Unix Linux
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd; START Solaris
en-GB,en-US;q=0.8,en    jsdjpksdkskd;lkskd; Aix SCO

You will get an empty line to start the output with. 您将获得一个空行以开始输出。 Also, use tab delimiter on output if so desired: awk -v OFS="\\t" ... . 另外,如果需要,在输出上使用制表符定界符: awk -v OFS="\\t" ...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM