[英]Create a parser script with a delimiter
I am trying to convert this input from file.txt
我正在尝试从
file.txt
转换此输入
a,b;c^d"e}
f;g,h!;i8j-
into this output进入这个 output
a,b,c,d,e,f,g,h,i,j
with awk与 awk
The best I did so far is到目前为止我做的最好的是
awk '$1=$1' FS="[;,^}8-]" OFS="." file.txt
"
as a special character? "
doesn`t work"
作为特殊字符解释为? "
不起作用,,
in the output and delete the last ,
,,
中重复,并删除最后一个,
One in awk (not for all awks, tested successfully in gawk, mawk, busybox awk and Macos awk version 20200816, unsuccessfully in Debian's awk version 20121220 aka original-awk. Limitations in locales as well.) awk 中的一个(不适用于所有 awk,在 gawk、mawk、busybox awk 和 Macos awk 版本 20200816 中测试成功,在 Debian 的 awk 版本 20121220 aka original-awk 中测试失败。区域设置的限制也是如此。)
$ awk -v RS="^$" '{ # read whole file in
gsub(/[^a-z]+/,",") # replace all non lowercase alphabet substrings with a comma
sub(/,$/,"") # remove trailing comma
}1' file # output
Output: Output:
a,b,c,d,e,f,g,h,i,j
Using any POSIX awk and assuming you want any non-alphabetic character to act as a field separator:使用任何 POSIX awk 并假设您希望任何非字母字符充当字段分隔符:
$ awk -F '[^[:alpha:]]+' -v OFS=',' '{printf "%s", p; $1=$1; p=$0} END{sub(OFS"$","",p); print p}' file
a,b,c,d,e,f,g,h,i,j
If you really do just want to use the specific set of characters in your question as the field separators then just change [^[:alpha:]]+
to [;;^}8"-]+
如果您真的只想使用问题中的特定字符集作为字段分隔符,则只需将
[^[:alpha:]]+
更改为[;;^}8"-]+
I would harness GNU AWK
for this task following way, let file.txt
content be我将按照以下方式利用 GNU
AWK
完成此任务,让file.txt
内容为
a,b;c^d"e} f;g,h!;i8j-
then然后
awk 'BEGIN{FPAT="[a-z]";OFS=","}{$1=$1;print}' file.txt
gives output给出 output
a,b,c,d,e,f,g,h,i,j
Explanation: I inform GNU AWK
that field is single lowercase ASCII letter using FPAT
, and output field separator ( OFS
) is ,
, then for each line I do $1=$1
to trigger line rebuild and print
line.说明:我使用
FPAT
通知 GNU AWK
字段是单个小写 ASCII 字母,而 output 字段分隔符 ( OFS
) 是,
,然后我对每一行执行$1=$1
以触发行重建和print
行。
(tested in GNU Awk 5.0.1) (在 GNU Awk 5.0.1 中测试)
If you only want to replace non-letter characters with commas and squeeze repeated commas, tr
is your friend:如果你只想用逗号替换非字母字符并压缩重复的逗号,
tr
是你的朋友:
tr -sc '[:alpha:]' ','
Using gnu-sed
replace 1 or more chars other than az with a comma.使用
gnu-sed
将 az 以外的 1 个或多个字符替换为逗号。 Then remove all leading and trailing comma's然后删除所有前导和尾随逗号
sed -Ez 's/[^a-z]+/,/g; s/^,+|,+$//' file
Output Output
a,b,c,d,e,f,g,h,i,j
If ed
is available/acceptable.如果
ed
可用/可接受。
The script.ed
script.ed
%s/[^a-z]/ /g
%s/[[:blank:]]\{1,\}/,/g
g/./;j\
s/,$//
,p
Q
Now run现在运行
ed -s file.txt < script.ed
echo "${input_data}" |
mawk 'NF-=_==$NF' FS='[^[:alpha:]]*' OFS=, RS=
a,b,c,d,e,f,g,h,i,j
if there's possibility of leading edge seps, use this instead:如果有前缘 seps 的可能性,请改用它:
echo ']a['
gawk 'gsub("^,|,$",_,$:(NF=NF))^_' FS='[^[:alpha,]]*' OFS=, RS=
a
If you are ok with Perl solution, here is an one-liner;如果您对 Perl 解决方案没问题,这里是单行;
perl -ne '$_ =~ s/[^[:alnum:]]//g; print join(",", split//, $_)'
which outputs:输出:
a,b,c,d,ef,g,h,i,8,j
Simply, you are substituting characters that are not alpha-numeric with nothing.简单地说,您是用什么替换不是字母数字的字符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.