简体   繁体   中英

Grep lines from a file in batches according to a format

I have a file with contents as:

Hi
welcome
! Chunk Start
Line 1
Line2
! Chunk Start
Line 1
Line 2
Line 3
! Chunk Start
Line 1
Line 2
Line 3
Line 1
Line 2
Line 3
Line 4
Line 5
Line 1
Line 2
Line 3
Line 4

Now, everything beginning with and before the next is a chunk, ie the lines between , make a chunk. 并且在下一个之前的都是块,即, 之间的线成为一个块。 I need to get the contents of each chunk in a single line. ie:

Line 1 Line 2
Line 1 Line2 Line 3
Line 1 Line 2 Line 3 Line 1 Line 2 Line 3 Line 4 Line 5 Line 1 Line 2 Line 3 Line 4

I have done this, but I think there should be a better way. The way I have done this is:

grep -A100 "! Chunk Start" file.txt

Rest of the logic is there to concat the lines. But this is what I am worried about. What if there are more than 100 lines in a chunk, this will fail. I probably need to do this with awk/sed. Please suggest.

You can use GNU AWK ( gawk ). It has a GNU extension for a powerful regexp form of the record separator RS to divide the input by ! Chunk Start ! Chunk Start . Each line of your "chunks" can then be processed as a field. Standard AWK has a limit on the number of fields (99 or something?), but gawk supports up to MAX_LONG fields . This large number of fields should solve your worry about 100+ input lines per chunk.

$ gawk 'BEGIN{RS="! Chunk Start\n";FS="\n"}NR>1{$1=$1;print}' infile.txt

AWK (and GNU AWK) works by dividing input into records , then dividing each record into fields . Here, we are dividing records (record separator RS ) based on the string ! Chunk Start ! Chunk Start and then dividing each record into fields (field separator FS ) based on a newline \\n . You can also specify a custom output record separator ORS and custom output field separator OFS , but in this case what we want happen to be the defaults ( ORS="\\n" and OFS=" " ).

When dividing into records, the part before the first ! Chunk Start ! Chunk Start will be considered a record. We ignore this using NR>1 . I have interpreted your problem specification

everything beginning with "! Chunk Start" and before the next "! Chunk Start" is a chunk

to mean that once ! Chunk Start ! Chunk Start has been seen, everything else until the end of input belongs in at least some chunk .

The mysterious $1=$1 forces gawk to reprocess the input line $0 , which parses it using the input format ( FS ), consuming the newlines. The print prints this reprocessed line using the output format ( OFS and ORS ).

Edit: The version above prints spaces at the end of each line. Thanks to @EdMorton for pointing out that the default field separator FS separates on whitespace (including newlines), so FS should be left unmodified:

$ gawk 'BEGIN{RS="! Chunk Start\n"}NR>1{$1=$1;print}' infile.txt

This might work for you (GNU sed):

sed '0,/^! Chunk Start/d;:a;$!N;/! Chunk Start/!s/\n/ /;ta;P;d' file

Delete upto and including the first line containing ! Chunk Start ! Chunk Start . Gather up lines replacing the newline by a space. When the next match is found print the first line, delete the pattern space and repeat.

Good grief. Just use awk:

$ awk -v RS='! Chunk Start' '{$1=$1}NR>1' file
Line 1 Line2
Line 1 Line 2 Line 3
Line 1 Line 2 Line 3 Line 1 Line 2 Line 3 Line 4 Line 5 Line 1 Line 2 Line 3 Line 4

The above uses GNU awk for multi-char RS.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM