管道分隔的文件，其中有空条目；转换为制表符分隔的“ <empty> 介于

Question

Problem 问题

I have been given a pipe-delimited text file that contains filenames and some indexed information from each file. 我得到了一个以竖线分隔的文本文件，其中包含文件名和每个文件的一些索引信息。 My goal is to make this a tab delimited file. 我的目标是使它成为制表符分隔的文件。 However , I want to know where the empty entries are. 但是，我想知道空条目在哪里。 This will be done, eg with lorem||dolor becoming lorem '\\t' <empty> '\\t' dolor . 例如，可以通过lorem||dolor成为lorem '\\t' <empty> '\\t' dolor 。

Let me give another couple of examples for what I've been given and what is desired: 让我再举几个例子说明我所得到的和所期望的：

Example with multiple lines: (NB There are the same number of entries on each line.) 多行示例：（注意，每行上的条目数相同。）

Given: 鉴于：

||dolor|sit
amet,||adipiscing|
sed|do|eiusmod|tempor

Desired: 期望的：

<empty> '\t' <empty> '\t' dolor '\t' sit '\n'
amet, '\t' <empty> '\t' adipiscing '\t' <empty> '\n'
sed '\t' do '\t' eiusmod '\t' tempor '\n'

Empty entries at the beginning and end. 在开头和结尾处都为空。

Given: 鉴于：

|ut|labore||dolore||

Desired: 期望的：

<empty> '\t' ut '\t' labore '/t' <empty> '\t' dolore '\t' <empty> '\t' <empty>

(I don't want the spaces; I just thought it would make the desired format more easy to read.) （我不需要空格；我只是认为这会使所需的格式更易于阅读。）

The problem comes with consecutive empty entries. 问题来自连续的空条目。 The files I've been given can have from 1 to 36 consecutive pipes (0 to 37 consecutive empty entries.) 给我的文件可以具有1到36个连续的管道（0到37个连续的空条目）。

Clarification 澄清度

The solution doesn't have to be sed , awk , grep , tr , etc. Those are just the solutions I've looked at. 解决方案不必是sed ， awk ， grep ， tr等。这些只是我所研究的解决方案。 A perl or python script (or any other idea I haven't thought of) would be welcome as well. 也欢迎使用perl或python脚本（或其他我没想到的想法）。

My attempts and research 我的尝试和研究

For the attempts I made before and during my research, the commands and their output are included as an image ¹ and a text file ² so as to not over-clutter the question. 对于我在研究之前和研究期间所做的尝试，将命令及其输出作为图像¹和文本文件²包含在内，以免使问题过于混乱。

My Attempts image 我的尝试图片

My Attempts text 我的尝试文字

Links to things I looked up -- Finding consecutive pipes with sed (and replacing any such series of pipes) : ref. 链接到我查找的内容 -使用sed查找连续的管道（并替换任何此类管道）：参考。 here ; 在这里 ; Counting the number of empty fields (possibly useful in knowing how many <empty> 's are needed) : ref. 计算空字段的数量（可能有助于了解需要多少<empty> ）：参考。 here ; 在这里 ; Longest sequence : ref here ; 最长序列：REF 这里 ;

System information 系统信息

$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin
$ bash --version
GNU bash, version 4.3.42(4)-release (x86_64-unknown-cygwin) ...
$

I'm running this version of Cygwin on Windows 10 (because the job requires it.) 我正在Windows 10上运行此版本的Cygwin（因为此工作需要它。）

Edit1 编辑1

I was unclear on what exactly was desired. 我不清楚究竟需要什么。

Here's a short example showing what I would like with pipes at the beginning and end: 这是一个简短的示例，显示了我希望在开始和结束时使用管道：

(This is what you'll see and need to type if you type the first line, hit enter, type the second line, hit enter, etc. It can't be copy/pasted, because the > only show up after you hit enter on the previous line.) （这是您看到的内容，如果您键入第一行，按Enter键，键入第二行，按Enter键等，则需要输入。无法复制/粘贴，因为>只会在您单击后显示在上一行输入。）

$ cat > myfile.txt<<EOF
> ||foo|||bar||
> EOF

$ <**command-to-be-used**> myfile.txt | cat -A
<empty>^I<empty>^Ifoo^I<empty>^I<empty>^Ibar^I<empty>^I<empty>$

Where the ^I is how my version of bash shows a '\\t' . ^I是我的bash版本显示'\\t' 。 From the answers given using some example text I gave, I realized that I would like an <empty> at the end, after labore (see the command below). 从使用我给出的一些示例文本给出的答案中，我意识到，在labore之后，我最后想要一个<empty> （请参见下面的命令）。 Note that the answers received (thanks @Neil_McGuigan and @Ed_Morton) DO give a '\\t' after labore , just not an <empty> . 请注意，收到的答案（感谢@Neil_McGuigan和@Ed_Morton）在工作后labore给出了'\\t' ，而不是<empty> 。 This is my fault, as I was not clear enough in my original description. 这是我的错，因为我在原始描述中不够清楚。 My apologies. 我很抱歉。

I was able to accomplish my goal with a little tweaking of @Neil_McGuigan's command. 通过稍微调整@Neil_McGuigan的命令，我能够实现我的目标。 Note that, if you want to type this "line-by-line" as shown, you'll need to include a space and a \\ at the end of each line. 请注意，如果您要如图所示键入此“逐行”，则需要在每行的末尾包含一个空格和一个\\ 。

$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | 
  awk '
       {
         $1=$1; n_empty=0; 
         for(i=1; i<=NF; i++) 
         { 
           if($i=="") {$i="<empty>"; n_empty++;}
         }; 
         print
       }
       END {print n_empty" entries are empty" | "cat 1>&2";}
      ' FS='|' OFS=$'\t'
   | cat -A

gives the result: 给出结果：

<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$
9 entries are empty

Once again, for those who don't want to scroll, this output is as follows: 再次，对于那些不想滚动的用户，此输出如下：

<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$ 9 entries are empty

(Note that the count of empty entries being written to stderr was not necessary, but it is nice.) （请注意，没有必要将空条目数写入stderr ，但这很好。）

Sorry for not being clear about what I wanted. 对不起，我不清楚我想要什么。

What I Used Successfully 我成功使用了什么

Thanks to @Neil_McGuigan and @Ed_Morton, I was able to get the solution for which I was searching. 多亏@Neil_McGuigan和@Ed_Morton，我才能够得到我正在寻找的解决方案。 My final command was as follows: 我的最终命令如下：

$ awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) {if($i=="") {$i="<empty>"; n_empty++;}}; print;} END {print n_empty" entries are empty" | "cat 1>&2";}' FS='|' OFS=$'\t' file_pipe-delim.txt > file_tab-delim.txt

$

Just in case you don't want to scroll, here is the same command: 以防万一您不想滚动，这里是相同的命令：

$ awk '{$1=$1; for(i=1; i<NF; i++){ if($(i)=="")$(i)="<empty>" }; print}'
  FS='|' OFS=$'\t' file_pipe-delim.txt | sed 's/\t$/\t<empty>/g' > 
  file_tab-delim.txt

$

Here's an example where the file is made, converted, and saved: 这是制作，转换和保存文件的示例：

(This is what you'll see and need to type if you type the first line, hit enter, type the second line, hit enter, etc. It can't be copy/pasted, because the > only show up after you hit enter on the previous line.) （这是您看到的内容，如果您键入第一行，按Enter键，键入第二行，按Enter键等，则需要输入。无法复制/粘贴，因为>只会在您单击后显示在上一行输入。）

$ cat > file_pipe-delim.txt<<EOF
> ||dolor|sit
> amet,||adipiscing|
> sed|do|eiusmod|tempor
> |||
> |aliqua.|Ut|
> EOF

$ awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) 
{if($i=="") {$i="<empty>"; n_empty++;}}; print;} END 
{print n_empty" entries are empty" | "cat 1>&2";}' 
FS='|' OFS=$'\t' file_pipe-delim.txt > file_tab-delim.txt


$ cat -A file_tab-delim.txt
<empty>^I<empty>^Idolor^Isit$
amet,^I<empty>^Iadipiscing^I<empty>$
sed^Ido^Ieiusmod^Itempor$
<empty>^I<empty>^I<empty>^I<empty>$
<empty>^Ialiqua.^IUt^I<empty>$

$

Finally, let's return the string that gave me trouble. 最后，让我们返回给我带来麻烦的字符串。 We can get the desired output as follows: 我们可以得到所需的输出，如下所示：

$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) {if($i=="") {$i="<empty>"; n_empty++;}}; print;} END {print n_empty" entries are empty" | "cat 1>&2";}' FS='|' OFS=$'\t' | cat -A
<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$
9 entries are empty

Now, the same command without the pipe to cat -A , meaning that we won't see the ^I for each '\\t' ; 现在，相同的命令不带管道到cat -A ，这意味着我们不会看到每个'\\t'的^I ； we will just see the text as it is "tabbed." 我们将只看到“制表符”中的文本。

$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | \ 
awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) \
{if($i=="") {$i="<empty>"; n_empty++;}}; print;} END \
{print n_empty" entries are empty" | "cat 1>&2";}' \
FS='|' OFS=$'\t'

<empty> <empty> lorem   ipsum   <empty> sit     amet,   <empty> <empty> <empty>eiusmod  tempor  <empty> <empty> labore  <empty>
9 entries are empty

Answer 1

awk '
     {
       $1=$1; 
       for(i=1; i<NF; i++) { 
         if($i=="") { $i="<empty>"; empty++ }
       }; 
       print
     }
     END { print empty" empty" | "cat 1>&2"; }
' FS='|' OFS=$'\t'

Should do the trick. 应该做到的。 $1=$1 tells awk to "rebuild" the input fields so they can be used with the new OutputFieldSeparator (OFS). $ 1 = $ 1告诉awk“重建”输入字段，以便它们可与新的OutputFieldSeparator（OFS）一起使用。

print empty" empty" | "cat 1>&2" print empty" empty" | "cat 1>&2" prints "n empty" to stderr. print empty" empty" | "cat 1>&2" “ n empty”打印到stderr。 You can omit it if you like 如果愿意，可以省略它

Answer 2

You only need to do the || 您只需要执行|| -> |<empty>| -> |<empty>| substitution twice no matter how many times that pattern appears as long as you do it globally each time: 只要该模式每次在全局范围内出现一次，替换都会出现两次，无论该模式出现多少次：

$ sed 's/||/|<empty>|/g; s/||/|<empty>|/g; s/|/\t/g' file
lorem   ipsum   <empty> sit     amet,   <empty> <empty> <empty> eiusmod tempor <empty>  <empty> labore

or if you prefer awk: 或者，如果您更喜欢awk：

$ awk '{while(gsub(/\|\|/,"|<empty>|")); gsub(/\|/,"\t")} 1' file
lorem   ipsum   <empty> sit     amet,   <empty> <empty> <empty> eiusmod tempor <empty>  <empty> labore

With some seds you might need '$'\\t'' instead of just \\t . 对于某些sed，您可能需要'$'\\t''而不是\\t 。

管道分隔的文件，其中有空条目；转换为制表符分隔的“ <empty> 介于

问题描述

Problem 问题

My attempts and research 我的尝试和研究

System information 系统信息

Edit1 编辑1

What I Used Successfully 我成功使用了什么

2 个解决方案

解决方案1
2 已采纳 2016-08-10 17:51:29

解决方案2
1 2016-08-10 20:19:51

管道分隔的文件，其中有空条目； 转换为制表符分隔的“ <empty> 介于

问题描述

Problem 问题

My attempts and research 我的尝试和研究

System information 系统信息

Edit1 编辑1

What I Used Successfully 我成功使用了什么

2 个解决方案

解决方案1 2 已采纳 2016-08-10 17:51:29

解决方案2 1 2016-08-10 20:19:51

管道分隔的文件，其中有空条目；转换为制表符分隔的“ <empty> 介于

解决方案1
2 已采纳 2016-08-10 17:51:29

解决方案2
1 2016-08-10 20:19:51