[英]Pipe-delimited file with empty entries; convert to tab-delimited with '<empty>' between
I have been given a pipe-delimited text file that contains filenames and some indexed information from each file. 我得到了一个以竖线分隔的文本文件,其中包含文件名和每个文件的一些索引信息。 My goal is to make this a tab delimited file. 我的目标是使它成为制表符分隔的文件。 However , I want to know where the empty entries are. 但是 ,我想知道空条目在哪里。 This will be done, eg with lorem||dolor
becoming lorem
'\\t'
<empty>
'\\t'
dolor
. 例如,可以通过lorem||dolor
成为lorem
'\\t'
<empty>
'\\t'
dolor
。
Let me give another couple of examples for what I've been given and what is desired: 让我再举几个例子说明我所得到的和所期望的:
Example with multiple lines: (NB There are the same number of entries on each line.) 多行示例:(注意,每行上的条目数相同。)
Given: 鉴于:
||dolor|sit
amet,||adipiscing|
sed|do|eiusmod|tempor
Desired: 期望的:
<empty> '\t' <empty> '\t' dolor '\t' sit '\n'
amet, '\t' <empty> '\t' adipiscing '\t' <empty> '\n'
sed '\t' do '\t' eiusmod '\t' tempor '\n'
Empty entries at the beginning and end. 在开头和结尾处都为空。
Given: 鉴于:
|ut|labore||dolore||
Desired: 期望的:
<empty> '\t' ut '\t' labore '/t' <empty> '\t' dolore '\t' <empty> '\t' <empty>
(I don't want the spaces; I just thought it would make the desired format more easy to read.) (我不需要空格;我只是认为这会使所需的格式更易于阅读。)
The problem comes with consecutive empty entries. 问题来自连续的空条目。 The files I've been given can have from 1 to 36 consecutive pipes (0 to 37 consecutive empty entries.) 给我的文件可以具有1到36个连续的管道(0到37个连续的空条目)。
Clarification 澄清度
The solution doesn't have to be sed
, awk
, grep
, tr
, etc. Those are just the solutions I've looked at. 解决方案不必是sed
, awk
, grep
, tr
等。这些只是我所研究的解决方案。 A perl
or python
script (or any other idea I haven't thought of) would be welcome as well. 也欢迎使用perl
或python
脚本(或其他我没想到的想法)。
For the attempts I made before and during my research, the commands and their output are included as an image 1 and a text file 2 so as to not over-clutter the question. 对于我在研究之前和研究期间所做的尝试,将命令及其输出作为图像1和文本文件2包含在内,以免使问题过于混乱。
Links to things I looked up -- Finding consecutive pipes with sed
(and replacing any such series of pipes) : ref. 链接到我查找的内容 -使用sed
查找连续的管道(并替换任何此类管道):参考。 here ; 在这里 ; Counting the number of empty fields (possibly useful in knowing how many <empty>
's are needed) : ref. 计算空字段的数量(可能有助于了解需要多少<empty>
):参考。 here ; 在这里 ; Longest sequence : ref here ; 最长序列:REF 这里 ;
$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin
$ bash --version
GNU bash, version 4.3.42(4)-release (x86_64-unknown-cygwin) ...
$
I'm running this version of Cygwin on Windows 10 (because the job requires it.) 我正在Windows 10上运行此版本的Cygwin(因为此工作需要它。)
I was unclear on what exactly was desired. 我不清楚究竟需要什么。
Here's a short example showing what I would like with pipes at the beginning and end: 这是一个简短的示例,显示了我希望在开始和结束时使用管道:
(This is what you'll see and need to type if you type the first line, hit enter, type the second line, hit enter, etc. It can't be copy/pasted, because the >
only show up after you hit enter on the previous line.) (这是您看到的内容,如果您键入第一行,按Enter键,键入第二行,按Enter键等,则需要输入。无法复制/粘贴,因为>
只会在您单击后显示在上一行输入。)
$ cat > myfile.txt<<EOF
> ||foo|||bar||
> EOF
$ <**command-to-be-used**> myfile.txt | cat -A
<empty>^I<empty>^Ifoo^I<empty>^I<empty>^Ibar^I<empty>^I<empty>$
Where the ^I
is how my version of bash
shows a '\\t'
. ^I
是我的bash
版本显示'\\t'
。 From the answers given using some example text I gave, I realized that I would like an <empty>
at the end, after labore
(see the command below). 从使用我给出的一些示例文本给出的答案中,我意识到,在labore
之后,我最后想要一个<empty>
(请参见下面的命令)。 Note that the answers received (thanks @Neil_McGuigan and @Ed_Morton) DO give a '\\t'
after labore
, just not an <empty>
. 请注意,收到的答案(感谢@Neil_McGuigan和@Ed_Morton)在工作后labore
给出了'\\t'
,而不是<empty>
。 This is my fault, as I was not clear enough in my original description. 这是我的错,因为我在原始描述中不够清楚。 My apologies. 我很抱歉。
I was able to accomplish my goal with a little tweaking of @Neil_McGuigan's command. 通过稍微调整@Neil_McGuigan的命令,我能够实现我的目标。 Note that, if you want to type this "line-by-line" as shown, you'll need to include a space and a \\
at the end of each line. 请注意,如果您要如图所示键入此“逐行”,则需要在每行的末尾包含一个空格和一个\\
。
$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" |
awk '
{
$1=$1; n_empty=0;
for(i=1; i<=NF; i++)
{
if($i=="") {$i="<empty>"; n_empty++;}
};
print
}
END {print n_empty" entries are empty" | "cat 1>&2";}
' FS='|' OFS=$'\t'
| cat -A
gives the result: 给出结果:
<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$
9 entries are empty
Once again, for those who don't want to scroll, this output is as follows: 再次,对于那些不想滚动的用户,此输出如下:
<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$ 9 entries are empty
(Note that the count of empty entries being written to stderr
was not necessary, but it is nice.) (请注意,没有必要将空条目数写入stderr
,但这很好。)
Sorry for not being clear about what I wanted. 对不起,我不清楚我想要什么。
Thanks to @Neil_McGuigan and @Ed_Morton, I was able to get the solution for which I was searching. 多亏@Neil_McGuigan和@Ed_Morton,我才能够得到我正在寻找的解决方案。 My final command was as follows: 我的最终命令如下:
$ awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) {if($i=="") {$i="<empty>"; n_empty++;}}; print;} END {print n_empty" entries are empty" | "cat 1>&2";}' FS='|' OFS=$'\t' file_pipe-delim.txt > file_tab-delim.txt
$
Just in case you don't want to scroll, here is the same command: 以防万一您不想滚动,这里是相同的命令:
$ awk '{$1=$1; for(i=1; i<NF; i++){ if($(i)=="")$(i)="<empty>" }; print}'
FS='|' OFS=$'\t' file_pipe-delim.txt | sed 's/\t$/\t<empty>/g' >
file_tab-delim.txt
$
Here's an example where the file is made, converted, and saved: 这是制作,转换和保存文件的示例:
(This is what you'll see and need to type if you type the first line, hit enter, type the second line, hit enter, etc. It can't be copy/pasted, because the >
only show up after you hit enter on the previous line.) (这是您看到的内容,如果您键入第一行,按Enter键,键入第二行,按Enter键等,则需要输入。无法复制/粘贴,因为>
只会在您单击后显示在上一行输入。)
$ cat > file_pipe-delim.txt<<EOF
> ||dolor|sit
> amet,||adipiscing|
> sed|do|eiusmod|tempor
> |||
> |aliqua.|Ut|
> EOF
$ awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++)
{if($i=="") {$i="<empty>"; n_empty++;}}; print;} END
{print n_empty" entries are empty" | "cat 1>&2";}'
FS='|' OFS=$'\t' file_pipe-delim.txt > file_tab-delim.txt
$ cat -A file_tab-delim.txt
<empty>^I<empty>^Idolor^Isit$
amet,^I<empty>^Iadipiscing^I<empty>$
sed^Ido^Ieiusmod^Itempor$
<empty>^I<empty>^I<empty>^I<empty>$
<empty>^Ialiqua.^IUt^I<empty>$
$
Finally, let's return the string that gave me trouble. 最后,让我们返回给我带来麻烦的字符串。 We can get the desired output as follows: 我们可以得到所需的输出,如下所示:
$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) {if($i=="") {$i="<empty>"; n_empty++;}}; print;} END {print n_empty" entries are empty" | "cat 1>&2";}' FS='|' OFS=$'\t' | cat -A
<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$
9 entries are empty
Now, the same command without the pipe to cat -A
, meaning that we won't see the ^I
for each '\\t'
; 现在,相同的命令不带管道到cat -A
,这意味着我们不会看到每个'\\t'
的^I
; we will just see the text as it is "tabbed." 我们将只看到“制表符”中的文本。
$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | \
awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) \
{if($i=="") {$i="<empty>"; n_empty++;}}; print;} END \
{print n_empty" entries are empty" | "cat 1>&2";}' \
FS='|' OFS=$'\t'
<empty> <empty> lorem ipsum <empty> sit amet, <empty> <empty> <empty>eiusmod tempor <empty> <empty> labore <empty>
9 entries are empty
awk '
{
$1=$1;
for(i=1; i<NF; i++) {
if($i=="") { $i="<empty>"; empty++ }
};
print
}
END { print empty" empty" | "cat 1>&2"; }
' FS='|' OFS=$'\t'
Should do the trick. 应该做到的。 $1=$1 tells awk to "rebuild" the input fields so they can be used with the new OutputFieldSeparator (OFS). $ 1 = $ 1告诉awk“重建”输入字段,以便它们可与新的OutputFieldSeparator(OFS)一起使用。
print empty" empty" | "cat 1>&2"
print empty" empty" | "cat 1>&2"
prints "n empty" to stderr. print empty" empty" | "cat 1>&2"
“ n empty”打印到stderr。 You can omit it if you like 如果愿意,可以省略它
You only need to do the ||
您只需要执行||
-> |<empty>|
-> |<empty>|
substitution twice no matter how many times that pattern appears as long as you do it globally each time: 只要该模式每次在全局范围内出现一次,替换都会出现两次,无论该模式出现多少次:
$ sed 's/||/|<empty>|/g; s/||/|<empty>|/g; s/|/\t/g' file
lorem ipsum <empty> sit amet, <empty> <empty> <empty> eiusmod tempor <empty> <empty> labore
or if you prefer awk: 或者,如果您更喜欢awk:
$ awk '{while(gsub(/\|\|/,"|<empty>|")); gsub(/\|/,"\t")} 1' file
lorem ipsum <empty> sit amet, <empty> <empty> <empty> eiusmod tempor <empty> <empty> labore
With some seds you might need '$'\\t''
instead of just \\t
. 对于某些sed,您可能需要'$'\\t''
而不是\\t
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.