简体   繁体   English

从管道分隔文件中删除不以时间戳开头的行的换行符

[英]Removing new line character from pipes delimited file for lines not starting with timestamp

Here is an example of the data: 这是数据示例:

2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013
NUM: 90834098
data: 0394884
cX: 90h010f03040f
mR: 034050t0ds0
cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210

I am in need of a script to remove the new line character from lines that do not begin with a timestamp. 我需要一个脚本来从不以时间戳开头的行中删除新行字符。 In the example above, lines 2-5 would be appended to the last field in the first line in a sort of text blob. 在上面的示例中,第2-5行将以某种文本斑点的形式添加到第一行的最后一个字段。 I know how to detect the good lines, 我知道如何发现好线,

grep '^[0-9][0-9][0-9][0-9].*' testfile

and also the bad lines, 还有不好的线条

grep '^[^0-9][^0-9][^0-9][^0-9].*' testfile

The question now is, how do I apply this (using sed?) in order to put the lines following a 'good' line back into the last field of this line. 现在的问题是,我该如何应用(使用sed?)以便将“好”行之后的行放回到该行的最后一个字段中。 Any help here would be much appreciated. 在这里的任何帮助将不胜感激。

Here is an example of the desired output: 这是所需输出的示例:

2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013 NUM: 90834098 data: 0394884 cX: 90h010f03040f mR: 034050t0ds0 cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406 |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603 |PHONE HOME|SDRKRKS|REAS|something|TN 90210

Edit: 编辑:

There is some disagreement as to which is the most appropriate tool. 关于哪种才是最合适的工具存在一些分歧。 At the moment I am leaning towards notepad++. 目前,我倾向于使用记事本++。 This is close to the kind of thing I want to do but it is not quite working, maybe someone out there can help me tune it to my use case: 这与我想做的事情很接近,但效果不佳,也许有人可以帮助我将其调整为我的用例:

(?! [0-9]{4}\-[0-9]{2}-[0-9]{2}).*

(?! [0-9]{4}\-[0-9]{2}-[0-9]{2})  - searches for a line not like a timestamp
.*                                  - followed by anything else

The problem is that the .* catches the timestamp that I am attempting to negate. 问题是。*捕获了我试图否定的时间戳。 Any thoughts? 有什么想法吗?

Edit 2: Thanks everyone for the helpful advice, it's definitely moving me in the right direction! 编辑2:谢谢大家的有用建议,这肯定使我朝着正确的方向前进! The following regex finds the problematic \\n char in notepad++, but when I try to perform the substitution nothing happens: 以下正则表达式在记事本++中发现有问题的\\ n char,但是当我尝试执行替换时,什么也没发生:

Find: (.*)(\n)(?![0-9]{4}\-[0-9]{2}\-[0-9]{2})
Replace: \1

Does anyone have any ideas here as to how to force notepad++ to remove the problematic \\n? 是否有人对如何强制使用notepad ++删除有问题的\\ n有任何想法?

Edit 3: Here is additional sample data that does not seem to work with the proposed solutions: 编辑3:以下是其他示例数据,这些数据似乎不适用于建议的解决方案:

2013-06-22 00:00:02.540298|0238704723874        |SMELL TEST|HAKEKJ  |REAS|No cooking|tcna / ncc
2013-06-22 00:00:04.302887|3289749873342        |SMELL TEST|ICNIDF  |REAS|No cooking|JINUJ/CVGIND/NASR
6:13 AM 6/22/2013
VERIFIED CURLING
TN :- 834974978398
XX and YY updated
THIS IS A SENTENCE
2013-06-22 00:00:06.937545|30874987392838        |SMELL TEST|KCIDKD  |REAS|No cooking|SrutiD/cvgind/nasr
tn 4887839847

Using all of your posted sample input concatenated in one file: 使用所有发布的样本输入并置在一个文件中:

$ cat file
2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013
NUM: 90834098
data: 0394884
cX: 90h010f03040f
mR: 034050t0ds0
cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210
2013-06-22 00:00:02.540298|0238704723874        |SMELL TEST|HAKEKJ  |REAS|No cooking|tcna / ncc
2013-06-22 00:00:04.302887|3289749873342        |SMELL TEST|ICNIDF  |REAS|No cooking|JINUJ/CVGIND/NASR
6:13 AM 6/22/2013
VERIFIED CURLING
TN :- 834974978398
XX and YY updated
THIS IS A SENTENCE
2013-06-22 00:00:06.937545|30874987392838        |SMELL TEST|KCIDKD  |REAS|No cooking|SrutiD/cvgind/nasr
tn 4887839847

.

$ awk 'NR>1{pre = (/^[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}/ ? ORS : OFS)} {printf "%s%s",pre,$0} END{print ""}' file
2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013 NUM: 90834098 data: 0394884 cX: 90h010f03040f mR: 034050t0ds0 cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210
2013-06-22 00:00:02.540298|0238704723874        |SMELL TEST|HAKEKJ  |REAS|No cooking|tcna / ncc
2013-06-22 00:00:04.302887|3289749873342        |SMELL TEST|ICNIDF  |REAS|No cooking|JINUJ/CVGIND/NASR 6:13 AM 6/22/2013 VERIFIED CURLING TN :- 834974978398 XX and YY updated THIS IS A SENTENCE
2013-06-22 00:00:06.937545|30874987392838        |SMELL TEST|KCIDKD  |REAS|No cooking|SrutiD/cvgind/nasr tn 4887839847

If that's not your expected output, please update your question to show what it is. 如果这不是您的预期输出,请更新您的问题以显示它的含义。

Simplest solution: 最简单的解决方案:

echo $(cat file) | sed -re 's/(2013-06)/@@@\1/g' | sed -re 's/@@@/\n/g'

This works because echo without quotes put everything in the same line, then we insert @@@ before the timestamp and the replace @@@ with new line character. 之所以有效,是因为没有引号的回声将所有内容放在同一行中,然后在时间戳之前插入@@@,并用新的行字符替换@@@。

tiago@dell:~$ echo $(cat file) | sed -re 's/(2013-06)/@@@\1/g' | sed -re 's/@@@/\n/g'

2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013 NUM: 90834098 data: 0394884 cX: 90h010f03040f mR: 034050t0ds0 cNUM: 034050t0ds0 
2013-06-22 00:00:49.307121|0950704421406 |PHONE HOME|SDRKRKS|REAS|something|MRS 
2013-06-22 00:00:50.379487|0441813679603 |PHONE HOME|SDRKRKS|REAS|something|TN 90210 
2013-06-22 00:00:02.540298|0238704723874 |SMELL TEST|HAKEKJ |REAS|No cooking|tcna / ncc 
2013-06-22 00:00:04.302887|3289749873342 |SMELL TEST|ICNIDF |REAS|No cooking|JINUJ/CVGIND/NASR 6:13 AM 6/22/2013 VERIFIED CURLING TN :- 834974978398 XX and YY updated THIS IS A SENTENCE 
2013-06-22 00:00:06.937545|30874987392838 |SMELL TEST|KCIDKD |REAS|No cooking|SrutiD/cvgind/nasr tn 4887839847
tiago@dell:~$ cat file
2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013
NUM: 90834098
data: 0394884
cX: 90h010f03040f
mR: 034050t0ds0
cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210
2013-06-22 00:00:02.540298|0238704723874        |SMELL TEST|HAKEKJ  |REAS|No cooking|tcna / ncc
2013-06-22 00:00:04.302887|3289749873342        |SMELL TEST|ICNIDF  |REAS|No cooking|JINUJ/CVGIND/NASR
6:13 AM 6/22/2013
VERIFIED CURLING
TN :- 834974978398
XX and YY updated
THIS IS A SENTENCE
2013-06-22 00:00:06.937545|30874987392838        |SMELL TEST|KCIDKD  |REAS|No cooking|SrutiD/cvgind/nasr
tn 4887839847

I am not sure what you like to do, since you have not provided with output example. 我不确定您想做什么,因为您没有提供输出示例。
But if you like to connect lines, you can try this awk 但是,如果您想连接线路,可以尝试使用这个awk

awk '{printf (!/2013/?" ":RS)"%s",$0} END {print ""}'

2013-06-22 00:00:49.307121|147374 |PHONE HOME|SDRKRKS|REAS|something|KRISTCOS 11:13 AM 6/22/2013 NUM: 90834098 data: 0394884 cX: 90h010f03040f mR: 034050t0ds0 cNUM: 034050t0ds0
2013-06-22 00:00:49.307121|0950704421406        |PHONE HOME|SDRKRKS|REAS|something|MRS
2013-06-22 00:00:50.379487|0441813679603        |PHONE HOME|SDRKRKS|REAS|something|TN 90210

Here is one way using GNU sed : 这是使用GNU sed一种方法:

sed -nr ':a;N;/\n[0-9]{4}-[0-9]{2}-[0-9]{2}/{P;$!D;s/.*\n//p};s/\n/ /g;$!ba;p' file

Explanation: 说明:

  • Create a label :a 创建标签:a
  • Append next line to current line on pattern space using N 使用N将下一行附加到图案空间上的当前行
  • /\\n[0-9]{4}-[0-9]{2}-[0-9]{2}/{P;$!D;s/.*\\n//p} Test if the line that is appended starts with date if so print up to the first newline and if it is not the last line, delete up to first new line. /\\n[0-9]{4}-[0-9]{2}-[0-9]{2}/{P;$!D;s/.*\\n//p}测试是否追加的行以日期开头(如果这样,则打印到第一行),如果不是最后一行,则删除直到第一行。 If it is the last line delete up to the newline and print it. 如果是最后一行,请删除直到换行并打印。
  • s/\\n/ /g; for all other lines keep removing new lines. 对于所有其他行,请继续删除新行。
  • ba branch back to our label and repeat ba分支回到我们的标签并重复

This might work for you (GNU sed): 这可能对您有用(GNU sed):

sed ':a;$!N;/^[^|]*$/Ms/\n/ /;ta' file

If the last line appended does not contain a | 如果附加的最后一行不包含| replace the newline with a space and repeat. 用空格替换换行,然后重复。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM