简体   繁体   English

如何合并多行以根据字段分隔符创建两个记录?

[英]How can I merge multiple lines to create exactly two records based on field separators?

I need help writing a Unix script loop to process the following data: 我需要帮助编写Unix脚本循环来处理以下数据:

200250|Wk50|200212|January|20024|Quarter4|2002|2002
|2003-01-12
|2003-01-18
|2003-01-05
|2003-02-01
|2002-11-03
|2003-02-01|
|2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002
|2002-10-27
|2002-11-02
|2002-10-06
|2002-11-02
|2002-08-04
|2002-11-02|
|2003-02-01|||||||

I have data in above format in a text file. 我在文本文件中有上述格式的数据。 What I need to do is remove newline characters on all lines which have | 我需要做的是删除所有包含|行的换行符 as the first character in the next line. 作为下一行的第一个字符。 The output I need is: 我需要的输出是:

200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02 |2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

I need some help to achieve this. 我需要一些帮助来实现这一目标。 These shell commands are giving me nightmares! 这些shell命令让我做恶梦!

The 'sed' approach: 'sed'方法:

sed ':a;N;$!ba;s/\n|/|/g' input.txt

Though, awk would be faster & easier to understand/maintain. 虽然,awk会更快,更容易理解/维护。 I just had that example handy (a common solution for removing trailing newlines w/ sed). 我只是把这个例子放在手边(一个用于删除带有sed的尾随换行符的常用解决方案)。

EDIT: 编辑:

To clarify the difference between this answer (option #1) and the alternative solution by @potong (which I actually prefer: sed ':a;N;s/\\n|/|/;ta;P;D' file ), which I'll call option #2: 为了澄清这个答案(选项#1)和@potong的替代解决方案之间的区别(我实际上更喜欢: sed ':a;N;s/\\n|/|/;ta;P;D' file ),我称之为选项#2:

  • note that these are two of many possible options with sed . 请注意,这些是sed的许多可能选项中的两个。 I actually prefer non- sed solutions since they do in general run faster. 我实际上更喜欢非sed解决方案,因为它们通常运行得更快。 But these two options are notable because they demonstrate two distinct ways to process a file: option #1 all in-memory, and option #2 as a stream. 但这两个选项值得注意,因为它们演示了两种不同的处理文件的方法:选项#1全部在内存中,选项#2作为流。 (note: below when I say "buffer", technically I mean "pattern space"): (注意:下面当我说“缓冲区”时,技术上我的意思是“模式空间”):
  • option #1 reads the whole file into memory: 选项#1将整个文件读入内存:
    • :a is just a label; :a只是一个标签; N says append the next line to the buffer; N表示将下一行附加到缓冲区; if end-of-file ( $ ) is not ( ! ) reached, then branch ( b ) back to label :a ... 如果文件结尾( $ )未达到( ! ),则分支( b )返回标签:a ...
    • then after the whole file is read into memory, process the buffer with the substitution command ( s ), replacing all occurrences of " \\n| " (newline followed by " | ") with just a " | ", on the entire ( g ) buffer 再经过整个文件被读入到存储器中,处理用替换命令(缓冲s ),取代“的所有出现\\n| ”(换行后跟“ |只用‘’) | ”,对整个( g ) 缓冲
  • option #2 just process a couple lines at a time: 选项#2一次只处理几行:
    • reads / appends the next line ( N ) into the buffer, processes it ( s/\\n|/|/ ); 读取/追加下一行( N )到缓冲区,处理它( s/\\n|/|/ ); branches ( t ) back to label :a only if the substitution was successful; branches( t )返回标签:a仅在替换成功时; otherwise prints ( P ) and clears/deletes ( D ) the current buffer up to the first embedded newline ... and the stream continues. 否则打印( P )并清除/删除( D )当前缓冲区直到第一个嵌入的换行符......然后流继续。
  • option #1 takes a lot more memory to run. 选项#1需要更多内存才能运行。 In general, as large as your file. 一般来说,与您的文件一样大。 Option #2 requires minimal memory; 选项#2需要最少的内存; so small I didn't bother to see what it correlates to (I'm guessing the length of a line.) 如此之小我没有费心去看它与之相关的东西(我猜的是一条线的长度。)
  • option #1 runs faster. 选项#1运行得更快。 In general, twice as fast as option #2; 通常,速度是选项#2的两倍; but obviously it depends on the file and what is being done. 但显然这取决于文件和正在做什么。

On a ~500MB file, option #1 runs about twice as fast (1.5s vs 3.4s), 在一个~500MB的文件中,选项#1的运行速度大约是其两倍(1.5s vs 3.4s),

$ du -h /tmp/foobar.txt
544M    /tmp/foobar.txt

$ time sed ':a;N;$!ba;s/\n|/|/g' /tmp/foobar.txt > /dev/null
real    0m1.564s
user    0m1.390s
sys 0m0.171s

$ time sed  ':a;N;s/\n|/|/;ta;P;D'  /tmp/foobar.txt  > /dev/null 
real    0m3.418s
user    0m3.239s
sys 0m0.163s

At the same time, option #1 takes about 500MB of memory, and option #2 requires less than 1MB: 同时,选项#1需要大约500MB的内存,选项#2需要不到1MB的内存:

$ ps -F -C sed
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
username  4197 11001 99 172427 558888 1 19:22 pts/10   00:00:01 sed :a;N;$!ba;s/\n|/|/g /tmp/foobar.txt

note: /proc/{pid}/smaps (Pss): 558188 (545M)

And option #2: 选项#2:

$ ps -F -C sed
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
username  4401 11001 99  3468   864   3 19:22 pts/10   00:00:03 sed :a;N;s/\n|/|/;ta;P;D /tmp/foobar.txt

note: /proc/{pid}/smaps (Pss): 236 (0M)

In summary (w/ commentary), 总之(带评论),

  • if you have files of unknown size, streaming without buffering is a better decision. 如果你有大小未知的文件,没有缓冲的流媒体是一个更好的决定。
  • if every second matters, then buffering the entire file and processing it at once may be fine -- but ymmv. 如果每一秒都重要,那么缓冲整个文件并立即处理它可能没问题 - 但是ymmv。
  • my personal experience with tuning shell scripts is that awk or perl (or tr , but it's the least portable) or even bash may be preferable to using sed . 我调整shell脚本的个人经验是awkperl (或tr ,但它是最不便携的)甚至bash可能比使用sed更可取。
  • yet, sed is a very flexible and powerful tool that gets a job done quickly, and can be tuned later. 然而, sed是一个非常灵活和强大的工具,可以快速完成工作,并可以在以后调整。

Here is an awk solution: 这是一个awk解决方案:

$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf "\n"$0} END{print""}' data

200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

Explanation: 说明:

Awk implicitly loops through every line in the file. Awk隐式循环遍历文件中的每一行。

  • substr($0,1,1)=="|"{printf $0;next}

    If this line begins with a vertical bar, then print it (without a final newline) and then skip to the next line. 如果此行以竖线开始,则打印它(没有最终换行符),然后跳到下一行。 We are using printf here, as opposed to the more common print , so that newlines are not printed unless we explicitly ask for them. 我们在这里使用printf ,而不是更常见的print ,因此除非我们明确要求,否则不会打印换行符。

  • {printf "\\n"$0}

    If the line didn't begin with a vertical bar, print a newline and then this line (again without a final newline). 如果该行没有以竖线开始,则打印换行符然后打印该行(再次没有最终换行符)。

  • END{print""}

    At the end of the file, print a newline. 在文件的末尾,打印换行符。

Refinement 精致

The above prints out an extra newline at the beginning of the file. 以上打印出文件开头的额外换行符。 If that is a problem, then it can be eliminated with just a minor change: 如果这是一个问题,那么只需稍作改动就可以消除它:

$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf new $0;new="\n"} END{print""}' data
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

This might work for you (GNU sed): 这可能适合你(GNU sed):

sed ':a;N;s/\n|/|/;ta;P;D' file

This processes the file a line at a time an alternative to @michael_n's which slurps the file content into memory before processing. 这会一次处理文件,而不是@ michael_n的文件,它在处理之前将文件内容篡改到内存中。

You could do this simply through perl, 你可以简单地通过perl来做到这一点,

$ perl -0777pe 's/\n(?=\|)//g' file
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||
awk -f test.awk input.txt  

test.awk test.awk

{
    if($0 ~ /^\|/)
    {
            array[i++] = $0
    }
    else
    {
            for(j=0;j<i;j++)
            {
                    line = line array[j];
            }
            i=0;
            print line
            line = $0;
    }
}
awk -f inp.awk input | sed '/^$/d'

inp.awk inp.awk

{
    if($0 !~ /^\|/)
     { 
       print line;
       line = $0;
      }
    else
      {
        line = line $0;
      }
 }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM