如何合并多行以根据字段分隔符创建两个记录？

Question

I need help writing a Unix script loop to process the following data: 我需要帮助编写Unix脚本循环来处理以下数据：

200250|Wk50|200212|January|20024|Quarter4|2002|2002
|2003-01-12
|2003-01-18
|2003-01-05
|2003-02-01
|2002-11-03
|2003-02-01|
|2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002
|2002-10-27
|2002-11-02
|2002-10-06
|2002-11-02
|2002-08-04
|2002-11-02|
|2003-02-01|||||||

I have data in above format in a text file. 我在文本文件中有上述格式的数据。 What I need to do is remove newline characters on all lines which have | 我需要做的是删除所有包含|行的换行符 as the first character in the next line. 作为下一行的第一个字符。 The output I need is: 我需要的输出是：

200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02 |2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

I need some help to achieve this. 我需要一些帮助来实现这一目标。 These shell commands are giving me nightmares! 这些shell命令让我做恶梦！

Answer 1

The 'sed' approach: 'sed'方法：

sed ':a;N;$!ba;s/\n|/|/g' input.txt

Though, awk would be faster & easier to understand/maintain. 虽然，awk会更快，更容易理解/维护。 I just had that example handy (a common solution for removing trailing newlines w/ sed). 我只是把这个例子放在手边（一个用于删除带有sed的尾随换行符的常用解决方案）。

EDIT: 编辑：

To clarify the difference between this answer (option #1) and the alternative solution by @potong (which I actually prefer: sed ':a;N;s/\\n|/|/;ta;P;D' file ), which I'll call option #2: 为了澄清这个答案（选项＃1）和@potong的替代解决方案之间的区别（我实际上更喜欢： sed ':a;N;s/\\n|/|/;ta;P;D' file ），我称之为选项＃2：

note that these are two of many possible options with sed . 请注意，这些是sed的许多可能选项中的两个。 I actually prefer non- sed solutions since they do in general run faster. 我实际上更喜欢非sed解决方案，因为它们通常运行得更快。 But these two options are notable because they demonstrate two distinct ways to process a file: option #1 all in-memory, and option #2 as a stream. 但这两个选项值得注意，因为它们演示了两种不同的处理文件的方法：选项＃1全部在内存中，选项＃2作为流。 (note: below when I say "buffer", technically I mean "pattern space"): （注意：下面当我说“缓冲区”时，技术上我的意思是“模式空间”）：
option #1 reads the whole file into memory: 选项＃1将整个文件读入内存：
- :a is just a label; :a只是一个标签; N says append the next line to the buffer; N表示将下一行附加到缓冲区; if end-of-file ( $ ) is not ( ! ) reached, then branch ( b ) back to label :a ... 如果文件结尾（ $ ）未达到（ ! ），则分支（ b ）返回标签:a ...
- then after the whole file is read into memory, process the buffer with the substitution command ( s ), replacing all occurrences of " \\n| " (newline followed by " | ") with just a " | ", on the entire ( g ) buffer 再经过整个文件被读入到存储器中，处理用替换命令（缓冲s ），取代“的所有出现\\n| ”（换行后跟“ |只用‘’） | ”，对整个（ g ）缓冲
option #2 just process a couple lines at a time: 选项＃2一次只处理几行：
- reads / appends the next line ( N ) into the buffer, processes it ( s/\\n|/|/ ); 读取/追加下一行（ N ）到缓冲区，处理它（ s/\\n|/|/ ）; branches ( t ) back to label :a only if the substitution was successful; branches（ t ）返回标签:a仅在替换成功时; otherwise prints ( P ) and clears/deletes ( D ) the current buffer up to the first embedded newline ... and the stream continues. 否则打印（ P ）并清除/删除（ D ）当前缓冲区直到第一个嵌入的换行符......然后流继续。
option #1 takes a lot more memory to run. 选项＃1需要更多内存才能运行。 In general, as large as your file. 一般来说，与您的文件一样大。 Option #2 requires minimal memory; 选项＃2需要最少的内存; so small I didn't bother to see what it correlates to (I'm guessing the length of a line.) 如此之小我没有费心去看它与之相关的东西（我猜的是一条线的长度。）
option #1 runs faster. 选项＃1运行得更快。 In general, twice as fast as option #2; 通常，速度是选项＃2的两倍; but obviously it depends on the file and what is being done. 但显然这取决于文件和正在做什么。

On a ~500MB file, option #1 runs about twice as fast (1.5s vs 3.4s), 在一个~500MB的文件中，选项＃1的运行速度大约是其两倍（1.5s vs 3.4s），

$ du -h /tmp/foobar.txt
544M    /tmp/foobar.txt

$ time sed ':a;N;$!ba;s/\n|/|/g' /tmp/foobar.txt > /dev/null
real    0m1.564s
user    0m1.390s
sys 0m0.171s

$ time sed  ':a;N;s/\n|/|/;ta;P;D'  /tmp/foobar.txt  > /dev/null 
real    0m3.418s
user    0m3.239s
sys 0m0.163s

At the same time, option #1 takes about 500MB of memory, and option #2 requires less than 1MB: 同时，选项＃1需要大约500MB的内存，选项＃2需要不到1MB的内存：

$ ps -F -C sed
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
username  4197 11001 99 172427 558888 1 19:22 pts/10   00:00:01 sed :a;N;$!ba;s/\n|/|/g /tmp/foobar.txt

note: /proc/{pid}/smaps (Pss): 558188 (545M)

And option #2: 选项＃2：

$ ps -F -C sed
UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
username  4401 11001 99  3468   864   3 19:22 pts/10   00:00:03 sed :a;N;s/\n|/|/;ta;P;D /tmp/foobar.txt

note: /proc/{pid}/smaps (Pss): 236 (0M)

In summary (w/ commentary), 总之（带评论），

if you have files of unknown size, streaming without buffering is a better decision. 如果你有大小未知的文件，没有缓冲的流媒体是一个更好的决定。
if every second matters, then buffering the entire file and processing it at once may be fine -- but ymmv. 如果每一秒都重要，那么缓冲整个文件并立即处理它可能没问题 - 但是ymmv。
my personal experience with tuning shell scripts is that awk or perl (or tr , but it's the least portable) or even bash may be preferable to using sed . 我调整shell脚本的个人经验是awk或perl （或tr ，但它是最不便携的）甚至bash可能比使用sed更可取。
yet, sed is a very flexible and powerful tool that gets a job done quickly, and can be tuned later. 然而， sed是一个非常灵活和强大的工具，可以快速完成工作，并可以在以后调整。

Answer 2

Here is an awk solution: 这是一个awk解决方案：

$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf "\n"$0} END{print""}' data

200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

Explanation: 说明：

Awk implicitly loops through every line in the file. Awk隐式循环遍历文件中的每一行。

substr($0,1,1)=="|"{printf $0;next}

If this line begins with a vertical bar, then print it (without a final newline) and then skip to the next line. 如果此行以竖线开始，则打印它（没有最终换行符），然后跳到下一行。 We are using printf here, as opposed to the more common print , so that newlines are not printed unless we explicitly ask for them. 我们在这里使用printf ，而不是更常见的print ，因此除非我们明确要求，否则不会打印换行符。
{printf "\\n"$0}

If the line didn't begin with a vertical bar, print a newline and then this line (again without a final newline). 如果该行没有以竖线开始，则打印换行符然后打印该行（再次没有最终换行符）。
END{print""}

At the end of the file, print a newline. 在文件的末尾，打印换行符。

Refinement 精致

The above prints out an extra newline at the beginning of the file. 以上打印出文件开头的额外换行符。 If that is a problem, then it can be eliminated with just a minor change: 如果这是一个问题，那么只需稍作改动就可以消除它：

$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf new $0;new="\n"} END{print""}' data
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

Answer 3

This might work for you (GNU sed): 这可能适合你（GNU sed）：

sed ':a;N;s/\n|/|/;ta;P;D' file

This processes the file a line at a time an alternative to @michael_n's which slurps the file content into memory before processing. 这会一次处理文件，而不是@ michael_n的文件，它在处理之前将文件内容篡改到内存中。

Answer 4

You could do this simply through perl, 你可以简单地通过perl来做到这一点，

$ perl -0777pe 's/\n(?=\|)//g' file
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

Answer 5

awk -f test.awk input.txt

test.awk test.awk

{
    if($0 ~ /^\|/)
    {
            array[i++] = $0
    }
    else
    {
            for(j=0;j<i;j++)
            {
                    line = line array[j];
            }
            i=0;
            print line
            line = $0;
    }
}

Answer 6

awk -f inp.awk input | sed '/^$/d'

inp.awk inp.awk

{
    if($0 !~ /^\|/)
     { 
       print line;
       line = $0;
      }
    else
      {
        line = line $0;
      }
 }

如何合并多行以根据字段分隔符创建两个记录？

问题描述

6 个解决方案

解决方案1
4 2014-09-12 05:15:57

解决方案2
3 已采纳 2014-09-12 05:29:27

Refinement 精致

解决方案3
3 2014-09-12 06:28:51

解决方案4
2 2014-09-12 06:00:58

解决方案5
1 2014-09-12 05:43:42

解决方案6
0 2014-09-12 11:21:06

如何合并多行以根据字段分隔符创建两个记录？

问题描述

6 个解决方案

解决方案1 4 2014-09-12 05:15:57

解决方案2 3 已采纳 2014-09-12 05:29:27

Refinement 精致

解决方案3 3 2014-09-12 06:28:51

解决方案4 2 2014-09-12 06:00:58

解决方案5 1 2014-09-12 05:43:42

解决方案6 0 2014-09-12 11:21:06

解决方案1
4 2014-09-12 05:15:57

解决方案2
3 已采纳 2014-09-12 05:29:27

解决方案3
3 2014-09-12 06:28:51

解决方案4
2 2014-09-12 06:00:58

解决方案5
1 2014-09-12 05:43:42

解决方案6
0 2014-09-12 11:21:06