简体   繁体   English

管道分隔的文件,其中有空条目; 转换为制表符分隔的“ <empty> 介于

[英]Pipe-delimited file with empty entries; convert to tab-delimited with '<empty>' between

Problem 问题

I have been given a pipe-delimited text file that contains filenames and some indexed information from each file. 我得到了一个以竖线分隔的文本文件,其中包含文件名和每个文件的一些索引信息。 My goal is to make this a tab delimited file. 我的目标是使它成为制表符分隔的文件。 However , I want to know where the empty entries are. 但是 ,我想知道空条目在哪里。 This will be done, eg with lorem||dolor becoming lorem '\\t' <empty> '\\t' dolor . 例如,可以通过lorem||dolor成为lorem '\\t' <empty> '\\t' dolor

Let me give another couple of examples for what I've been given and what is desired: 让我再举几个例子说明我所得到的和所期望的:

Example with multiple lines: (NB There are the same number of entries on each line.) 多行示例:(注意,每行上的条目数相同。)

Given: 鉴于:

||dolor|sit
amet,||adipiscing|
sed|do|eiusmod|tempor

Desired: 期望的:

<empty> '\t' <empty> '\t' dolor '\t' sit '\n'
amet, '\t' <empty> '\t' adipiscing '\t' <empty> '\n'
sed '\t' do '\t' eiusmod '\t' tempor '\n'

Empty entries at the beginning and end. 在开头和结尾处都为空。

Given: 鉴于:

|ut|labore||dolore||

Desired: 期望的:

<empty> '\t' ut '\t' labore '/t' <empty> '\t' dolore '\t' <empty> '\t' <empty>

(I don't want the spaces; I just thought it would make the desired format more easy to read.) (我不需要空格;我只是认为这会使所需的格式更易于阅读。)

The problem comes with consecutive empty entries. 问题来自连续的空条目。 The files I've been given can have from 1 to 36 consecutive pipes (0 to 37 consecutive empty entries.) 给我的文件可以具有1到36个连续的管道(0到37个连续的空条目)。

Clarification 澄清度

The solution doesn't have to be sed , awk , grep , tr , etc. Those are just the solutions I've looked at. 解决方案不必是sedawkgreptr等。这些只是我所研究的解决方案。 A perl or python script (or any other idea I haven't thought of) would be welcome as well. 也欢迎使用perlpython脚本(或其他我没想到的想法)。

My attempts and research 我的尝试和研究

For the attempts I made before and during my research, the commands and their output are included as an image 1 and a text file 2 so as to not over-clutter the question. 对于我在研究之前和研究期间所做的尝试,将命令及其输出作为图像1和文本文件2包含在内,以免使问题过于混乱。

My Attempts image 我的尝试图片

My Attempts text 我的尝试文字

Links to things I looked up -- Finding consecutive pipes with sed (and replacing any such series of pipes) : ref. 链接到我查找的内容 -使用sed查找连续的管道(并替换任何此类管道):参考。 here ; 在这里 ; Counting the number of empty fields (possibly useful in knowing how many <empty> 's are needed) : ref. 计算空字段的数量(可能有助于了解需要多少<empty> ):参考。 here ; 在这里 ; Longest sequence : ref here ; 最长序列:REF 这里 ;

System information 系统信息

$ uname -a
CYGWIN_NT-10.0 A-1052207 2.5.2(0.297/5/3) 2016-06-23 14:29 x86_64 Cygwin
$ bash --version
GNU bash, version 4.3.42(4)-release (x86_64-unknown-cygwin) ...
$

I'm running this version of Cygwin on Windows 10 (because the job requires it.) 我正在Windows 10上运行此版本的Cygwin(因为此工作需要它。)


Edit1 编辑1

I was unclear on what exactly was desired. 我不清楚究竟需要什么。

Here's a short example showing what I would like with pipes at the beginning and end: 这是一个简短的示例,显示了我希望在开始和结束时使用管道:

(This is what you'll see and need to type if you type the first line, hit enter, type the second line, hit enter, etc. It can't be copy/pasted, because the > only show up after you hit enter on the previous line.) (这是您看到的内容,如果您键入第一行,按Enter键,键入第二行,按Enter键等,则需要输入。无法复制/粘贴,因为>只会在您单击后显示在上一行输入。)

$ cat > myfile.txt<<EOF
> ||foo|||bar||
> EOF

$ <**command-to-be-used**> myfile.txt | cat -A
<empty>^I<empty>^Ifoo^I<empty>^I<empty>^Ibar^I<empty>^I<empty>$

Where the ^I is how my version of bash shows a '\\t' . ^I是我的bash版本显示'\\t' From the answers given using some example text I gave, I realized that I would like an <empty> at the end, after labore (see the command below). 从使用我给出的一些示例文本给出的答案中,我意识到,在labore之后,我最后想要一个<empty> (请参见下面的命令)。 Note that the answers received (thanks @Neil_McGuigan and @Ed_Morton) DO give a '\\t' after labore , just not an <empty> . 请注意,收到的答案(感谢@Neil_McGuigan和@Ed_Morton)在工作后labore给出了'\\t' ,而不是<empty> This is my fault, as I was not clear enough in my original description. 这是我的错,因为我在原始描述中不够清楚。 My apologies. 我很抱歉。

I was able to accomplish my goal with a little tweaking of @Neil_McGuigan's command. 通过稍微调整@Neil_McGuigan的命令,我能够实现我的目标。 Note that, if you want to type this "line-by-line" as shown, you'll need to include a space and a \\ at the end of each line. 请注意,如果您要如图所示键入此“逐行”,则需要在每行的末尾包含一个空格和一个\\

$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | 
  awk '
       {
         $1=$1; n_empty=0; 
         for(i=1; i<=NF; i++) 
         { 
           if($i=="") {$i="<empty>"; n_empty++;}
         }; 
         print
       }
       END {print n_empty" entries are empty" | "cat 1>&2";}
      ' FS='|' OFS=$'\t'
   | cat -A

gives the result: 给出结果:

<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$
9 entries are empty

Once again, for those who don't want to scroll, this output is as follows: 再次,对于那些不想滚动的用户,此输出如下:

<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$ 9 entries are empty

(Note that the count of empty entries being written to stderr was not necessary, but it is nice.) (请注意,没有必要将空条目数写入stderr ,但这很好。)

Sorry for not being clear about what I wanted. 对不起,我不清楚我想要什么。


What I Used Successfully 我成功使用了什么

Thanks to @Neil_McGuigan and @Ed_Morton, I was able to get the solution for which I was searching. 多亏@Neil_McGuigan和@Ed_Morton,我才能够得到我正在寻找的解决方案。 My final command was as follows: 我的最终命令如下:

$ awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) {if($i=="") {$i="<empty>"; n_empty++;}}; print;} END {print n_empty" entries are empty" | "cat 1>&2";}' FS='|' OFS=$'\t' file_pipe-delim.txt > file_tab-delim.txt

$

Just in case you don't want to scroll, here is the same command: 以防万一您不想滚动,这里是相同的命令:

$ awk '{$1=$1; for(i=1; i<NF; i++){ if($(i)=="")$(i)="<empty>" }; print}'
  FS='|' OFS=$'\t' file_pipe-delim.txt | sed 's/\t$/\t<empty>/g' > 
  file_tab-delim.txt

$

Here's an example where the file is made, converted, and saved: 这是制作,转换和保存文件的示例:

(This is what you'll see and need to type if you type the first line, hit enter, type the second line, hit enter, etc. It can't be copy/pasted, because the > only show up after you hit enter on the previous line.) (这是您看到的内容,如果您键入第一行,按Enter键,键入第二行,按Enter键等,则需要输入。无法复制/粘贴,因为>只会在您单击后显示在上一行输入。)

$ cat > file_pipe-delim.txt<<EOF
> ||dolor|sit
> amet,||adipiscing|
> sed|do|eiusmod|tempor
> |||
> |aliqua.|Ut|
> EOF

$ awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) 
{if($i=="") {$i="<empty>"; n_empty++;}}; print;} END 
{print n_empty" entries are empty" | "cat 1>&2";}' 
FS='|' OFS=$'\t' file_pipe-delim.txt > file_tab-delim.txt


$ cat -A file_tab-delim.txt
<empty>^I<empty>^Idolor^Isit$
amet,^I<empty>^Iadipiscing^I<empty>$
sed^Ido^Ieiusmod^Itempor$
<empty>^I<empty>^I<empty>^I<empty>$
<empty>^Ialiqua.^IUt^I<empty>$

$

Finally, let's return the string that gave me trouble. 最后,让我们返回给我带来麻烦的字符串。 We can get the desired output as follows: 我们可以得到所需的输出,如下所示:

$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) {if($i=="") {$i="<empty>"; n_empty++;}}; print;} END {print n_empty" entries are empty" | "cat 1>&2";}' FS='|' OFS=$'\t' | cat -A
<empty>^I<empty>^Ilorem^Iipsum^I<empty>^Isit^Iamet,^I<empty>^I<empty>^I<empty>^Ieiusmod^Itempor^I<empty>^I<empty>^Ilabore^I<empty>$
9 entries are empty

Now, the same command without the pipe to cat -A , meaning that we won't see the ^I for each '\\t' ; 现在,相同的命令不带管道到cat -A ,这意味着我们不会看到每个'\\t'^I we will just see the text as it is "tabbed." 我们将只看到“制表符”中的文本。

$ echo "||lorem|ipsum||sit|amet,||||eiusmod|tempor|||labore|" | \ 
awk '{$1=$1; n_empty=0; for(i=1; i<=NF; i++) \
{if($i=="") {$i="<empty>"; n_empty++;}}; print;} END \
{print n_empty" entries are empty" | "cat 1>&2";}' \
FS='|' OFS=$'\t'

<empty> <empty> lorem   ipsum   <empty> sit     amet,   <empty> <empty> <empty>eiusmod  tempor  <empty> <empty> labore  <empty>
9 entries are empty
awk '
     {
       $1=$1; 
       for(i=1; i<NF; i++) { 
         if($i=="") { $i="<empty>"; empty++ }
       }; 
       print
     }
     END { print empty" empty" | "cat 1>&2"; }
' FS='|' OFS=$'\t'

Should do the trick. 应该做到的。 $1=$1 tells awk to "rebuild" the input fields so they can be used with the new OutputFieldSeparator (OFS). $ 1 = $ 1告诉awk“重建”输入字段,以便它们可与新的OutputFieldSeparator(OFS)一起使用。

print empty" empty" | "cat 1>&2" print empty" empty" | "cat 1>&2" prints "n empty" to stderr. print empty" empty" | "cat 1>&2" “ n empty”打印到stderr。 You can omit it if you like 如果愿意,可以省略它

You only need to do the || 您只需要执行|| -> |<empty>| -> |<empty>| substitution twice no matter how many times that pattern appears as long as you do it globally each time: 只要该模式每次在全局范围内出现一次,替换都会出现两次,无论该模式出现多少次:

$ sed 's/||/|<empty>|/g; s/||/|<empty>|/g; s/|/\t/g' file
lorem   ipsum   <empty> sit     amet,   <empty> <empty> <empty> eiusmod tempor <empty>  <empty> labore

or if you prefer awk: 或者,如果您更喜欢awk:

$ awk '{while(gsub(/\|\|/,"|<empty>|")); gsub(/\|/,"\t")} 1' file
lorem   ipsum   <empty> sit     amet,   <empty> <empty> <empty> eiusmod tempor <empty>  <empty> labore

With some seds you might need '$'\\t'' instead of just \\t . 对于某些sed,您可能需要'$'\\t''而不是\\t

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 用竖线分隔的文件中的零替换空格 - Replace spaces with zeroes in pipe-delimited file 如何在vb.net中将以逗号分隔的文件转换为以管道分隔的文件 - How to convert comma-delimited file to pipe-delimited in vb.net 使用记事本++将逗号分隔的CSV转换为管道分隔的文件 - Convert comma-delimited CSV to pipe-delimited file with Notepad++ 从文件中提取文本并将结果保存到管道分隔的文件中 - Extract text from a file and save the results into a pipe-delimited file C#RegEx在管道分隔文件中查找空单元格 - C# RegEx to find empty cells in pipe delimited file 从制表符分隔的文件中提取数字 - Extract numbers from tab-delimited files Java-分割一个字符串,该字符串可以用竖线分隔(“ \\\\ I”),逗号分隔(“,”),分号分隔(“;”)等 - Java - split a string that can be pipe-delimited (“\\I”), comma-delimited (“,”), semicolon-delimited (“;”) and others 正则表达式删除管道分隔的参差不齐的平面文件的最后一列中的所有文本 - regex to remove all text in last column of pipe-delimited ragged flat file 用于从管道分隔的CSV文件中提取数据的脚本失败 - script used to extract data from pipe-delimited CSV file fails 使用正则表达式(Oracle)从竖线分隔的字符串中获取值 - Fetching value from Pipe-delimited String using Regex (Oracle)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM