I have a file with contents as:
Hi
welcome
! Chunk Start
Line 1
Line2
! Chunk Start
Line 1
Line 2
Line 3
! Chunk Start
Line 1
Line 2
Line 3
Line 1
Line 2
Line 3
Line 4
Line 5
Line 1
Line 2
Line 3
Line 4
Now, everything beginning with and before the next is a chunk, ie the lines between , make a chunk. 并且在下一个之前的都是块,即, 之间的线成为一个块。 I need to get the contents of each chunk in a single line. ie:
Line 1 Line 2
Line 1 Line2 Line 3
Line 1 Line 2 Line 3 Line 1 Line 2 Line 3 Line 4 Line 5 Line 1 Line 2 Line 3 Line 4
I have done this, but I think there should be a better way. The way I have done this is:
grep -A100 "! Chunk Start" file.txt
Rest of the logic is there to concat the lines. But this is what I am worried about. 。 What if there are more than 100 lines in a chunk, this will fail. I probably need to do this with awk/sed. Please suggest.
You can use GNU AWK ( gawk
). It has a GNU extension for a powerful regexp form of the record separator RS
to divide the input by ! Chunk Start
! Chunk Start
. Each line of your "chunks" can then be processed as a field. Standard AWK has a limit on the number of fields (99 or something?), but gawk
supports up to MAX_LONG
fields . This large number of fields should solve your worry about 100+ input lines per chunk.
$ gawk 'BEGIN{RS="! Chunk Start\n";FS="\n"}NR>1{$1=$1;print}' infile.txt
AWK (and GNU AWK) works by dividing input into records , then dividing each record into fields . Here, we are dividing records (record separator RS
) based on the string ! Chunk Start
! Chunk Start
and then dividing each record into fields (field separator FS
) based on a newline \\n
. You can also specify a custom output record separator ORS
and custom output field separator OFS
, but in this case what we want happen to be the defaults ( ORS="\\n"
and OFS=" "
).
When dividing into records, the part before the first ! Chunk Start
! Chunk Start
will be considered a record. We ignore this using NR>1
. I have interpreted your problem specification
everything beginning with "! Chunk Start" and before the next "! Chunk Start" is a chunk
to mean that once ! Chunk Start
! Chunk Start
has been seen, everything else until the end of input belongs in at least some chunk .
The mysterious $1=$1
forces gawk
to reprocess the input line $0
, which parses it using the input format ( FS
), consuming the newlines. The print
prints this reprocessed line using the output format ( OFS
and ORS
).
Edit: The version above prints spaces at the end of each line. Thanks to @EdMorton for pointing out that the default field separator FS
separates on whitespace (including newlines), so FS
should be left unmodified:
$ gawk 'BEGIN{RS="! Chunk Start\n"}NR>1{$1=$1;print}' infile.txt
This might work for you (GNU sed):
sed '0,/^! Chunk Start/d;:a;$!N;/! Chunk Start/!s/\n/ /;ta;P;d' file
Delete upto and including the first line containing ! Chunk Start
! Chunk Start
. Gather up lines replacing the newline by a space. When the next match is found print the first line, delete the pattern space and repeat.
Good grief. Just use awk:
$ awk -v RS='! Chunk Start' '{$1=$1}NR>1' file
Line 1 Line2
Line 1 Line 2 Line 3
Line 1 Line 2 Line 3 Line 1 Line 2 Line 3 Line 4 Line 5 Line 1 Line 2 Line 3 Line 4
The above uses GNU awk for multi-char RS.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.