简体   繁体   中英

How can I read first n and last n lines from a file?

How can I read the first n lines and the last n lines of a file?

For n=2 , I read online that (head -n2 && tail -n2) would work, but it doesn't.

$ cat x
1
2
3
4
5
$ cat x | (head -n2 && tail -n2)
1
2

The expected output for n=2 would be:

1
2
4
5
head -n2 file && tail -n2 file

Chances are you're going to want something like:

... | awk -v OFS='\n' '{a[NR]=$0} END{print a[1], a[2], a[NR-1], a[NR]}'

or if you need to specify a number and taking into account @Wintermute's astute observation that you don't need to buffer the whole file, something like this is what you really want:

... | awk -v n=2 'NR<=n{print;next} {buf[((NR-1)%n)+1]=$0}
         END{for (i=1;i<=n;i++) print buf[((NR+i-1)%n)+1]}'

I think the math is correct on that - hopefully you get the idea to use a rotating buffer indexed by the NR modded by the size of the buffer and adjusted to use indices in the range 1-n instead of 0-(n-1).

To help with comprehension of the modulus operator used in the indexing above, here is an example with intermediate print statements to show the logic as it executes:

$ cat file   
1
2
3
4
5
6
7
8

.

$ cat tst.awk                
BEGIN {
    print "Populating array by index ((NR-1)%n)+1:"
}
{
    buf[((NR-1)%n)+1] = $0

    printf "NR=%d, n=%d: ((NR-1 = %d) %%n = %d) +1 = %d -> buf[%d] = %s\n",
        NR, n, NR-1, (NR-1)%n, ((NR-1)%n)+1, ((NR-1)%n)+1, buf[((NR-1)%n)+1]

}
END { 
    print "\nAccessing array by index ((NR+i-1)%n)+1:"
    for (i=1;i<=n;i++) {
        printf "NR=%d, i=%d, n=%d: (((NR+i = %d) - 1 = %d) %%n = %d) +1 = %d -> buf[%d] = %s\n",
            NR, i, n, NR+i, NR+i-1, (NR+i-1)%n, ((NR+i-1)%n)+1, ((NR+i-1)%n)+1, buf[((NR+i-1)%n)+1]
    }
}
$ 
$ awk -v n=3 -f tst.awk file
Populating array by index ((NR-1)%n)+1:
NR=1, n=3: ((NR-1 = 0) %n = 0) +1 = 1 -> buf[1] = 1
NR=2, n=3: ((NR-1 = 1) %n = 1) +1 = 2 -> buf[2] = 2
NR=3, n=3: ((NR-1 = 2) %n = 2) +1 = 3 -> buf[3] = 3
NR=4, n=3: ((NR-1 = 3) %n = 0) +1 = 1 -> buf[1] = 4
NR=5, n=3: ((NR-1 = 4) %n = 1) +1 = 2 -> buf[2] = 5
NR=6, n=3: ((NR-1 = 5) %n = 2) +1 = 3 -> buf[3] = 6
NR=7, n=3: ((NR-1 = 6) %n = 0) +1 = 1 -> buf[1] = 7
NR=8, n=3: ((NR-1 = 7) %n = 1) +1 = 2 -> buf[2] = 8

Accessing array by index ((NR+i-1)%n)+1:
NR=8, i=1, n=3: (((NR+i = 9) - 1 = 8) %n = 2) +1 = 3 -> buf[3] = 6
NR=8, i=2, n=3: (((NR+i = 10) - 1 = 9) %n = 0) +1 = 1 -> buf[1] = 7
NR=8, i=3, n=3: (((NR+i = 11) - 1 = 10) %n = 1) +1 = 2 -> buf[2] = 8

This might work for you (GNU sed):

sed -n ':a;N;s/[^\n]*/&/2;Ta;2p;$p;D' file

This keeps a window of 2 (replace the 2's for n) lines and then prints the first 2 lines and at end of file prints the window ie the last 2 lines.

awk -vn=4 'NR<=n; {b = b "\\n" $0} NR>=n {sub(/[^\\n]*\\n/,"",b)} END {print b}'

The first n lines are covered by NR<=n; . For the last n lines, we just keep track of a buffer holding the latest n lines, repeatedly adding one to the end and removing one from the front (after the first n).

It's possible to do it more efficiently, with an array of lines instead of a single buffer, but even with gigabytes of input, you'd probably waste more in brain time writing it out than you'd save in computer time by running it.

ETA: Because the above timing estimate provoked some discussion in (now deleted) comments, I'll add anecdata from having tried that out.

With a huge file (100M lines, 3.9 GiB, n=5) it's taken 454 seconds, compared to @EdMorton's lined-buffer solution, which executed in only 30 seconds. With more modest inputs ("mere" millions of lines) the ratio is similar: 4.7 seconds vs. 0.53 seconds.

Almost all of that additional time in this solution seems to be spent in the sub() function; a tiny fraction also does come from string concatenation being slower than just replacing an array member.

Here's a GNU sed one-liner that prints the first 10 and last 10 lines:

gsed -ne'1,10{p;b};:a;$p;N;21,$D;ba'

If you want to print a '--' separator between them:

gsed -ne'1,9{p;b};10{x;s/$/--/;x;G;p;b};:a;$p;N;21,$D;ba'

If you're on a Mac and don't have GNU sed, you can't condense as much:

sed -ne'1,9{' -e'p;b' -e'}' -e'10{' -e'x;s/$/--/;x;G;p;b' -e'}' -e':a' -e'$p;N;21,$D;ba'

Explanation

gsed -ne' invoke sed without automatic printing pattern space

-e'1,9{p;b}' print the first 9 lines

-e'10{x;s/$/--/;x;G;p;b}' print line 10 with an appended '--' separator

-e':a;$p;N;21,$D;ba' print the last 10 lines

Use GNU parallel . To print the first three lines and the last three lines:

parallel {} -n 3 file ::: head tail

Based on dcaswell's answer , the following sed script prints the first and last 10 lines of a file:

# Make a test file first
testit=$(mktemp -u)
seq 1 100 > $testit
# This sed script:
sed -n ':a;1,10h;N;${x;p;i\
-----
;x;p};11,$D;ba' $testit
rm $testit

Yields this:

1
2
3
4
5
6
7
8
9
10
-----
90
91
92
93
94
95
96
97
98
99
100

If you are using a shell that supports process substitution, another way to accomplish this is to write to multiple processes, one for head and one for tail . Suppose for this example your input comes from a pipe feeding you content of unknown length. You want to use just the first 5 lines and the last 10 lines and pass them on to another pipe:

cat | { tee >(head -5) >(tail -10) 1>/dev/null} | cat

The use of {} collects the output from inside the group (there will be two different programs writing to stdout inside the process shells). The 1>/dev/null is to get rid of the extra copy tee will try to write to it's own stdout.

That demonstrates the concept and all the moving parts, but it can be simplified a little in practice by using the STDOUT stream of tee instead of discarding it. Note the command grouping is still necessary here to pass the output on through the next pipe!

cat | { tee >(head -5) | tail -15 } | cat

Obviously replace cat in the pipeline with whatever you are actually doing. If your input can handle the same content to writing to multiple files you could eliminate the use of tee entirely as well as monkeying with STDOUT. Say you have a command that accepts multiple -o output file name flags:

{ mycommand -o >(head -5) -o >(tail -10)} | cat

Here is another AWK script. Assuming there might be overlap of head and tail.

File script.awk

BEGIN {range = 3} # Define the head and tail range
NR <= range {print} # Output the head; for the first lines in range
{ arr[NR % range] = $0} # Store the current line in a rotating array
END { # Last line reached
    for (row = NR - range + 1; row <= NR; row++) { # Reread the last range lines from array
        print arr[row % range];
    }
}

Running the script

seq 1 7 | awk -f script.awk

Output

1
2
3
5
6
7

For overlapping head and tail:

seq 1 5 |awk -f script.awk


1
2
3
3
4
5

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM