简体   繁体   English

创建带有分隔符的解析器脚本

[英]Create a parser script with a delimiter

I am trying to convert this input from file.txt我正在尝试从file.txt转换此输入

a,b;c^d"e}
f;g,h!;i8j-

into this output进入这个 output

a,b,c,d,e,f,g,h,i,j

with awk与 awk

The best I did so far is到目前为止我做的最好的是

awk '$1=$1' FS="[;,^}8-]" OFS="." file.txt

  1. how can I escape interpritating " as a special character? " doesn`t work我怎样才能避免将"作为特殊字符解释为? "不起作用
  2. avoid duplicate ,, in the output and delete the last ,避免在,,中重复,并删除最后一个,

One in awk (not for all awks, tested successfully in gawk, mawk, busybox awk and Macos awk version 20200816, unsuccessfully in Debian's awk version 20121220 aka original-awk. Limitations in locales as well.) awk 中的一个(不适用于所有 awk,在 gawk、mawk、busybox awk 和 Macos awk 版本 20200816 中测试成功,在 Debian 的 awk 版本 20121220 aka original-awk 中测试失败。区域设置的限制也是如此。)

$ awk -v RS="^$" '{      # read whole file in 
    gsub(/[^a-z]+/,",")  # replace all non lowercase alphabet substrings with a comma
    sub(/,$/,"")         # remove trailing comma
}1' file                 # output

Output: Output:

a,b,c,d,e,f,g,h,i,j

Using any POSIX awk and assuming you want any non-alphabetic character to act as a field separator:使用任何 POSIX awk 并假设您希望任何非字母字符充当字段分隔符:

$ awk -F '[^[:alpha:]]+' -v OFS=',' '{printf "%s", p; $1=$1; p=$0} END{sub(OFS"$","",p); print p}' file
a,b,c,d,e,f,g,h,i,j

If you really do just want to use the specific set of characters in your question as the field separators then just change [^[:alpha:]]+ to [;;^}8"-]+如果您真的只想使用问题中的特定字符集作为字段分隔符,则只需将[^[:alpha:]]+更改为[;;^}8"-]+

KISS :亲吻

$ grep -o '[a-z]' file | paste -sd ',' -
a,b,c,d,e,f,g,h,i,j

Should works on most GNU/Linux , even busybox & freeBSD (the - is then mandatory)应该适用于大多数GNU/Linux ,甚至是busyboxfreeBSD (然后-是强制性的)

I would harness GNU AWK for this task following way, let file.txt content be我将按照以下方式利用 GNU AWK完成此任务,让file.txt内容为

a,b;c^d"e} f;g,h!;i8j-

then然后

awk 'BEGIN{FPAT="[a-z]";OFS=","}{$1=$1;print}' file.txt

gives output给出 output

a,b,c,d,e,f,g,h,i,j

Explanation: I inform GNU AWK that field is single lowercase ASCII letter using FPAT , and output field separator ( OFS ) is , , then for each line I do $1=$1 to trigger line rebuild and print line.说明:我使用FPAT通知 GNU AWK字段是单个小写 ASCII 字母,而 output 字段分隔符 ( OFS ) 是, ,然后我对每一行执行$1=$1以触发行重建和print行。

(tested in GNU Awk 5.0.1) (在 GNU Awk 5.0.1 中测试)

If you only want to replace non-letter characters with commas and squeeze repeated commas, tr is your friend:如果你只想用逗号替换非字母字符并压缩重复的逗号, tr是你的朋友:

tr -sc '[:alpha:]' ','

Using gnu-sed replace 1 or more chars other than az with a comma.使用gnu-sed将 az 以外的 1 个或多个字符替换为逗号。 Then remove all leading and trailing comma's然后删除所有前导和尾随逗号

sed -Ez 's/[^a-z]+/,/g; s/^,+|,+$//' file

Output Output

a,b,c,d,e,f,g,h,i,j

If ed is available/acceptable.如果ed可用/可接受。

The script.ed script.ed

%s/[^a-z]/ /g
%s/[[:blank:]]\{1,\}/,/g
g/./;j\
s/,$//
,p
Q

Now run现在运行

ed -s file.txt < script.ed
 echo "${input_data}" | 
 mawk 'NF-=_==$NF' FS='[^[:alpha:]]*' OFS=, RS=
a,b,c,d,e,f,g,h,i,j

if there's possibility of leading edge seps, use this instead:如果有前缘 seps 的可能性,请改用它:

echo ']a['
 gawk 'gsub("^,|,$",_,$:(NF=NF))^_' FS='[^[:alpha,]]*' OFS=, RS=
a 

If you are ok with Perl solution, here is an one-liner;如果您对 Perl 解决方案没问题,这里是单行;

perl -ne '$_ =~ s/[^[:alnum:]]//g; print join(",", split//, $_)'

which outputs:输出:

a,b,c,d,ef,g,h,i,8,j

Simply, you are substituting characters that are not alpha-numeric with nothing.简单地说,您是用什么替换不是字母数字的字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM