简体   繁体   English

使用sed处理带分隔符的文本文件

[英]process a delimited text file with sed

I have a ";" 我有一个 ”;” delimited file: 分隔文件:

aa;;;;aa
rgg;;;;fdg
aff;sfg;;;fasg
sfaf;sdfas;;;           
ASFGF;;;;fasg
QFA;DSGS;;DSFAG;fagf

I'd like to process it replacing the missing value with a \\N . 我想处理它用\\N替换缺失值。 The result should be: 结果应该是:

aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;\N         
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf

I'm trying to do it with a sed script: 我正在尝试使用sed脚本:

sed "s/;\(;\)/;\\N\1/g" file1.txt  >file2.txt

But what I get is 但我得到的是

aa;\N;;\N;aa
rgg;\N;;\N;fdg
aff;sfg;\N;;fasg
sfaf;sdfas;\N;;         
ASFGF;\N;;\N;fasg
QFA;DSGS;\N;DSFAG;fagf

You don't need to enclose the second semicolon in parentheses just to use it as \\1 in the replacement string. 您不需要将第二个分号括在括号中,只是在替换字符串中将其用作\\1 You can use ; 你可以用; in the replacement string: 在替换字符串中:

sed 's/;;/;\\N;/g'

As you noticed, when it finds a pair of semicolons it replaces it with the desired string then skips over it, not reading the second semicolon again and this makes it insert \\N after every two semicolons. 正如您所注意到的,当它找到一对分号时,它会用所需的字符串替换它,然后跳过它,而不是再次读取第二个分号,这使得它在每两个分号后插入\\N

A solution is to use positive lookaheads; 一个解决方案是使用积极的前瞻; the regex is /;(?=;)/ but sed doesn't support them. regex/;(?=;)/但是sed不支持它们。

But it's possible to solve the problem using sed in a simple manner: duplicate the search command; 但是可以通过简单的方式使用sed解决问题:复制搜索命令; the first command replaces the odd appearances of ;; 第一个命令取代奇怪的外观;; with ;\\N , the second one takes care of the even appearances. ;\\N ,第二个照顾均匀的外观。 The final result is the one you need. 最终的结果是你需要的。

The command is as simple as: 命令很简单:

sed 's/;;/;\\N;/g;s/;;/;\\N;/g'

It duplicates the previous command and uses the ; 它复制了上一个命令并使用了; between g and s to separe them. gs之间切断它们。 Alternatively you can use the -e command line option once for each search expression: 或者,您可以为每个搜索表达式使用-e命令行选项一次:

sed -e 's/;;/;\\N;/g' -e 's/;;/;\\N;/g'

Update: 更新:

The OP asks in a comment "What if my file have 100 columns?" OP在评论中询问“如果我的文件有100列怎么办?”

Let's try and see if it works: 让我们试试看它是否有效:

$ echo "0;1;;2;;;3;;;;4;;;;;5;;;;;;6;;;;;;;" | sed 's/;;/;\\N;/g;s/;;/;\\N;/g'
0;1;\N;2;\N;\N;3;\N;\N;\N;4;\N;\N;\N;\N;5;\N;\N;\N;\N;\N;6;\N;\N;\N;\N;\N;\N;

Look, ma! 看,妈! It works! 有用! :-) :-)


Update #2 更新#2

I ignored the fact that the question doesn't ask to replace ;; 我忽略了这个问题没有要求更换的事实;; with something else but to replace the empty/missing values in a file that uses ; 使用其他东西,但要替换使用的文件中的空/缺少值; to separate the columns. 分隔列。 Accordingly, my expression doesn't fix the missing value when it occurs at the beginning or at the end of the line. 因此,当表达式出现在行的开头或结尾时,我的表达式不会修复缺失值。

As the OP kindly added in a comment, the complete sed command is: 正如OP在评论中添加的那样,完整的sed命令是:

sed 's/;;/;\\N;/g;s/;;/;\\N;/g;s/^;/\\N;/g;s/;$/;\\N/g'

or (for readability): 或(为了便于阅读):

sed -e 's/;;/;\\N;/g;' -e 's/;;/;\\N;/g;' -e 's/^;/\\N;/g' -e 's/;$/;\\N/g'

The two additional steps replace ';' 另外两个步骤取代';' when they found it at beginning or at the end of line. 当他们在开始或结束时找到它。

You can use this sed command with 2 s (substitute) commands: 您可以将此sed命令与2 s (替换)命令一起使用:

sed 's/;;/;\\N;/g; s/;;/;\\N;/g;' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf

Or using lookarounds regex in a perl command: 或者在perl命令中使用lookarounds regex

perl -pe 's/(?<=;)(?=;)/\\N/g' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf

The main problem is that you can't use several times the same characters for a single replacement: 主要问题是,对于单个替换,您不能使用多次相同的字符:

s/;;/..../g : The second ; s/;;/..../g :第二个; can't be reused for the next match in a string like ;;; 不能在字符串中的下一个匹配中重复使用;;;

If you want to do it with sed without to use a Perl-like regex mode, you can use a loop with the conditional command t : 如果你想使用sed而不使用类似Perl的正则表达式模式,你可以使用带有条件命令t的循环:

sed ':a;s/;;/;\\N;/g;ta;' file

:a defines a label "a", ta go to this label only if something has been replaced. :a定义了一个标签“一”, ta去只有当事情已被替换这个标签。

For the ; 对于; at the end of the line (and to deal with eventual trailing whitespaces): 在行尾(并处理最终的尾随空格):

sed ':a;s/;;/;\\N;/g;ta; s/;[ \t\r]*$/;\\N/1' file

this awk one-liner will give you what you want: 这个awk one-liner会给你你想要的东西:

awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N"}7' file

if you really want the line: sfaf;sdfas;\\N;\\N;\\N , this line works for you: 如果你真的想要这一行: sfaf;sdfas;\\N;\\N;\\N ,这行适用于你:

awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N";sub(/;$/,";\\N")}7' file
sed 's/;/;\\N/g;s/;\\N\([^;]\)/;\1/g;s/;[[:blank:]]*$/;\\N/' YourFile
  • non recursive, onliner, posix compliant 非递归,在线,posix兼容

Concept: 概念:

  • change all ; 改变一切;
  • put back unmatched one 放回无与伦比的
  • add the special case of last ; 添加最后一个特例; with eventually space before the end of line 在行尾之前最终有空间

This might work for you (GNU sed): 这可能适合你(GNU sed):

sed -r ':;s/^(;)|(;);|(;)$/\2\3\\N\1\2/g;t' file

There are 4 senarios in which an empty field may occur: at the start of a record, between 2 field delimiters, an empty field following an empty field and at the end of a record. 有4个可能出现空字段的情况:在记录的开头,2个字段分隔符之间,空字段后面的空字段和记录的结尾。 Alternation can be employed to cater for senarios 1,2 and 4 and senario 3 can be catered for by a second pass using a loop ( :;...;t ). 可以采用轮换来满足上述情况1,2和4,并且可以使用循环( :;...;t )通过第二次传递来满足senario 3。 Multiple senarios can be replaced in both passes using the g flag. 使用g标志可以在两次传递中替换多个senarios。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM