简体   繁体   English

拆分文本文件并在OSX中使用awk在标头中添加行数

[英]Splitting text file and adding line count in header with awk in OSX

I want to do the following with my text file that contains thousands of lines 我想用包含数千行的文本文件执行以下操作

  • Split the file at the lines starting with B (but does not include this line). 将文件拆分为以B开头的行(但不包括此行)。
  • Include the number of lines present in each split file as the header + additional text (ie <number of lines> " 120" ) 包括每个拆分文件中存在的行数作为标题+附加文本(即<number of lines> " 120"
  • Remove a symbol that starts each line (ie > ) 删除每行开始的符号(即>

I have tried the following code that allows me to split the file up, but the number of lines present in the file (as in NR-1 " 120" ) is cumulative and it is printed at the very end of the split file instead of at the start. 我尝试了下面的代码,允许我拆分文件,但文件中存在的行数(如在NR-1 " 120" )是累积的,它打印在拆分文件的最后,而不是在开始时。

awk '/^B/{n++; print NR-1 " 120" > filename;close(filename);next}{filename = "part" n ".txt"; print >filename}'

In my attempts to print it as a header, I have used the following code. 在我尝试将其打印为标题时,我使用了以下代码。 But the supposed header does not appear at all. 但是假定的标题根本没有出现。 awk 'BEGIN{print NR-1 " 120" > filename}; /^B/{n++;close(filename);next};{filename = "part" n ".txt"; print >filename}' inputfile.txt

and the following error comes with the above code: awk: null file name in print or getline source line number 1 以上代码附带以下错误: awk: null file name in print or getline source line number 1

My text file looks something like: 我的文本文件看起来像:

>L1212 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L1222 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L1232 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
B       *        -                     |1|
>L4212 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4312 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4412 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4512 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
B       *        -                     |2|
>L4212 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4312 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4412 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4512 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4312 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4412 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
>L4512 ATCTATCTTCTATCTGTTAGCTAGCTAGCTA
B       *        -                     |3|

Update: A roundabout to using the script by @mklement0 without using Mawk or GNU awk, I used grep in textwrangler to change all lines starting with B to a single character ~ . 更新:使用@ mklement0脚本而不使用Mawk或GNU awk的环形交叉口,我在textwrangler中使用grep将所有以B开头的行更改为单个字符~

With GNU Awk or Mawk: 使用GNU Awk或Mawk:

awk -v RS='\nB       \\*        -                     \\|[0-9]+\\|\n' 'NF {
  numLines = gsub("(^|\n)>", "\n") # replace line-initial ">" and count lines in block
  fname = "part" ++n               # determine next output filename
  printf "%s%s\n", numLines " 120", $0 > fname # output header + block
  close(fname)                               # close output file
}' file

Note: Unless the last line in the input file is a separator line, the last output file will have a trailing empty line (the data-line count in the header will be correct, however) - the OP has confirmed this not to be a problem. 注意:除非输入文件中的最后一行是分隔线,否则最后一个输出文件将有一个尾随空行(但是标题中的数据行计数是正确的) - OP已经确认这不是一个问题。

  • GNU Awk or Mawk are needed, because only they support multi-character regex-based RS (input-record separator) values - unlike the BSD awk that macOS comes with. 需要GNU Awk或Mawk,因为只有它们支持基于多字符的基于正则表达式的RS (输入记录分隔符)值 - 与macOS附带的BSD awk不同。 It is possible to solve this problem differently, but it would be a little more cumbersome. 可能以不同的方式解决这个问题,但这会更麻烦一些。

    • Both GNU Awk and Mawk can be installed on macOS via package manager Homebrew ; GNU Awk和Mawk都可以通过软件包管理器Homebrew安装在macOS上; with Homebrew installed, simply run brew install gawk or brew install mawk . 安装Homebrew后,只需运行brew install gawkbrew install mawk
  • The approach breaks the input into blocks of lines, by the B separator lines. 该方法通过B分隔线将输入分成线 Thus, each such block must fit into memory as a whole (presumably two copies at once, due to performing a string substitution. 因此,每个这样的块必须作为整体适合存储器(可能是由于执行字符串替换而一次两个副本。

  • Having the whole block of lines in memory before writing them to the output file is what allows counting the lines up front and adding that information to the header . 在将它们写入输出文件之前将整个行存储在内存中是允许在前面计算行并将该信息添加到标题中的行

    • numLines = gsub("(^|\\n)>", "\\n") performs both the removal of the line-initial > chars. numLines = gsub("(^|\\n)>", "\\n")执行line-initial > chars的删除。 and determines the number of lines in the block, taking advantage of the fact that gsub() returns the number of replacements made. 并确定块中的行数,利用gsub()返回所做替换次数的事实。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM