如何使用awk读取每n个字符而不是每行的文件？

Question

This is the content of file.txt : 这是file.txt的内容：

hello bro
my nam§
is Jhon Does

The file could also contain non-printable characters (for example \\x00, or \\x02) , and, as you can see, the lenght of the lines are not the same. 该文件还可以包含不可打印的字符（例如\\ x00或\\ x02），并且，如您所见，行的长度不相同。

Then I want to read it each each 5 characters without having into a count line breaks. 然后，我想每5个字符读取一次，而不必换行。 I thought in something like this using awk: 我想用awk这样的事情：

awk -v RS='' '{
  s=s $0;
}END{
  n=length(s);

  for(x=1; x<n; x=x+5){
    # Here I will put some calcs and stuff

    i++;
    print "line " i ": #" substr(s,x,5) "#"
  }
}' file.txt

The output is the following: 输出如下：

line 1: #hello#
line 2: # bro
#
line 3: #my na#
line 4: #m§
is#
line 5: # Jhon#
line 6: # Does#

It works perfectly, but the input file will be very large, so the performance is important. 它可以完美运行，但是输入文件将非常大，因此性能很重要。

In short, I'm looking for something like this: 简而言之，我正在寻找这样的东西：

awk -v RS='.{5}' '{ # Here I will put some calcs and stuff }'

But it doesn't works. 但这是行不通的。

Another alternative that works ok: 另一个可行的选择：

xxd -ps mifile.txt | tr -d '\n' | fold -w 10 | awk '{print "23" $0 "230a"}' | xxd -ps -r

Do you have any idea or alternative? 你有什么想法或选择吗？ Thank you. 谢谢。

Answer 1

You can use perl and binmode assuming you are using normal characters. 假设您使用的是普通字符，则可以使用perl和binmode。

use strict;
use warnings;

open my $fh, '<', 'test'; 
#open the file.
binmode $fh;
# Set to binary mode
$/ = \5;
#Read a record as 5 bytes

while(<$fh>){
#Read records
        print "$_#"
        #Do whatever calculations you want here
}

For extended character sets you can use UTF8 and read every 5 characters instead of bytes. 对于扩展字符集，可以使用UTF8并每5个字符而不是字节读取一次。

use strict;
use warnings;

open my $fh, '<:utf8', 'test';
#open file in utf8.
binmode(STDOUT, ":utf8");
# Set stdout to utf8 as well

while ((read($fh, my $data, 5)) != 0){
#Read 5 characters into variable data
    print "$data#";
    #Do whatever you want with data here
}

Answer 2

If you are okay with Python , You may try this 如果您对Python没问题 ，可以尝试一下

f = open('filename', 'r+')
w = f.read(5)
while(w != ''):
        print w;
        w = f.read(5);
f.close()

Answer 3

So you asked How to read a file each n characters instead of each line using awk . 因此，您问如何使用awk而不是每行读取n个字符的文件。

Solution : 解决方案 ：

If you have a modern gawk implementation use FPAT 如果您有现代的gawk实现，请使用FPAT

Normally, when using FS, gawk defines the fields as the parts of the record that occur in between each field separator. 通常，当使用FS时，gawk会将字段定义为记录的一部分，出现在每个字段分隔符之间。 In other words, FS defines what a field is not, instead of what a field is. 换句话说，FS定义了什么不是字段，而不是什么字段。 However, there are times when you really want to define the fields by what they are , and not by what they are not. 但是， 有时您确实想根据字段的定义而不是不是字段的定义 。

Code: 码：

gawk 'BEGIN{FS="\n";RS="";FPAT=".{,5}"}
            {for (i=1;i<=NF;i++){
               printf("$%d = <%s>\n", i, $i)}
            }' file

Check the demo 检查演示

Answer 4

I'm not sure I understand what you want but this outputs the same as the script in your question that you say works perfectly so hopefully this is it: 我不确定我是否了解您想要的内容，但这与您问题中的脚本的输出相同，您说的很完美，因此希望是这样：

$ awk -v RS='.{5}' 'RT!=""{ print "line", NR ": #" RT "#" }' file
line 1: #hello#
line 2: # bro
#
line 3: #my na#
line 4: #m§
is#
line 5: # Jhon#
line 6: # Does#

The above uses GNU awk for multi-char RS and RT. 上面使用GNU awk进行多字符RS和RT。

如何使用awk读取每n个字符而不是每行的文件？

问题描述

4 个解决方案

解决方案1
1 2016-03-03 09:46:02

解决方案2
1 2016-03-03 10:31:40

解决方案3
1 2016-03-03 10:54:39

解决方案4
1 已采纳 2016-03-03 23:02:26

如何使用awk读取每n个字符而不是每行的文件？

问题描述

4 个解决方案

解决方案1 1 2016-03-03 09:46:02

解决方案2 1 2016-03-03 10:31:40

解决方案3 1 2016-03-03 10:54:39

解决方案4 1 已采纳 2016-03-03 23:02:26

解决方案1
1 2016-03-03 09:46:02

解决方案2
1 2016-03-03 10:31:40

解决方案3
1 2016-03-03 10:54:39

解决方案4
1 已采纳 2016-03-03 23:02:26