Specific data formatting using awk or sed

Question

I am currently working with large datasets containing file information which is formatted into blocks of data. I am trying to take a piece of data from the file path line and append it as a new column on certain lines. The dataset contains file information formatted like so:

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10
7a:8b:8e:20:7b:38               1982                    10
b9:45:3d:f4:97:88               1849                    10
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10
19:c5:b2:aa:b3:60               613                     10
11:7c:7e:76:4b:d5               1272                    10
36:e0:59:49:b6:4a               581                     10
9c:31:bc:8a:39:94               3296                    10
01:f0:56:3a:e1:a9               1140                    10
Whole File Hash: 4b28b44ae03d

What I am wanting to do is take the file type (.jar and .c in this example) and append it to their respective Chunk Hash lines so the final formatting would look like:

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)       
45:97:2a:60:e3:69               3208                    10                              .jar
7a:8b:8e:20:7b:38               1982                    10                              .jar
b9:45:3d:f4:97:88               1849                    10                              .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)       
e8:b0:cb:6f:76:ff               1344                    10                              .c
19:c5:b2:aa:b3:60               613                     10                              .c
11:7c:7e:76:4b:d5               1272                    10                              .c
36:e0:59:49:b6:4a               581                     10                              .c
9c:31:bc:8a:39:94               3296                    10                              .c
01:f0:56:3a:e1:a9               1140                    10                              .c
Whole File Hash: 4b28b44ae03d

I already have the awk code to pull the file type and the chunk hash lines:

awk 'match($0,/\..+/) {print substr($0,RSTART,RLENGTH)}'

awk '/Chunk Hash/{flag=1;next}/Whole File Hash:/{flag=0}flag'

I am just not sure on how to connect these pieces using awk (or sed) to append the file type as a new column onto each line in their respective data block. Another thing to note is I am trying to do this in a bash script if that makes a difference.

Answer 1

Solution in TXR language:

@(repeat)
@  (cases)
File path: @*path.@suff
Inode Num: @inode
@header
@    (collect)
@hashline
@    (last)
Whole File Hash: @wfh
@    (end)
@    (output)
File path: @path.@suff
Inode Num: @inode
@header
@      (repeat)
@{hashline 88}.@suff
@      (end)
Whole File Hash: @wfh
@    (end)
@  (or)
@other
@  (do (put-line other))
@  (end)
@(end)

Run:

$ txr suffixes.txr data
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10                              .jar
7a:8b:8e:20:7b:38               1982                    10                              .jar
b9:45:3d:f4:97:88               1849                    10                              .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10                              .c
19:c5:b2:aa:b3:60               613                     10                              .c
11:7c:7e:76:4b:d5               1272                    10                              .c
36:e0:59:49:b6:4a               581                     10                              .c
9c:31:bc:8a:39:94               3296                    10                              .c
01:f0:56:3a:e1:a9               1140                    10                              .c
Whole File Hash: 4b28b44ae03d

Answer 2

Here is a (GNU) sed solution:

/File path:/ {         # If line matches "File path:"
    h                  # Copy pattern space to hold space
    s/.*(\.[^.]*)$/\1/ # Remove everything but extension from pattern space
    x                  # Swap pattern space and hold space
}                      # Hold space now contains extension
/Chunk Hash/ {         # If line matches "Chunk Hash"
    n                  # Get next line into pattern space
    :loop              # Anchor for loop
    /Whole File Hash/b # If line matches "Whole File Hash", jump out of loop
    G                  # Append extension from hold space to pattern space
    s/\n/\t\t\t\t/     # Substitute newline with a bunch of tabs
    n                  # Get next line
    b loop             # Jump back to ":loop" label
}

This can be stored in a separate file (say, so.sed ), and has to be called like

sed -r -f so.sed infile

resulting in

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10                              .jar
7a:8b:8e:20:7b:38               1982                    10                              .jar
b9:45:3d:f4:97:88               1849                    10                              .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10                              .c
19:c5:b2:aa:b3:60               613                     10                              .c
11:7c:7e:76:4b:d5               1272                    10                              .c
36:e0:59:49:b6:4a               581                     10                              .c
9c:31:bc:8a:39:94               3296                    10                              .c
01:f0:56:3a:e1:a9               1140                    10                              .c
Whole File Hash: 4b28b44ae03d

Non-GNU seds have to jump through the usual hoops to insert tabs and can't use the -r option (but probably -E , which should be equivalent here; -r was just used for convenience to having to escape () ).

Answer 3

In awk:

$ cat script.awk
/File path/ { 
    match($0,/\..+/)
    ext=substr($0,RSTART,RLENGTH)
} 
/Chunk Hash/ {
    flag=1            # flag on
    print             # print here to...
    next              # avoid printing ext
} 
/Whole File Hash:/ {  
    flag=0            # flag off
} 
flag==1 {
    print $0, ext     # add space here to your liking, left it short...
    next              # ... to show output on screen without sidescrolling
} 1                   # print non-flagged records

Run:

$ awk -f script.awk data.txt
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10 .jar
7a:8b:8e:20:7b:38               1982                    10 .jar
b9:45:3d:f4:97:88               1849                    10 .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10 .c
19:c5:b2:aa:b3:60               613                     10 .c
11:7c:7e:76:4b:d5               1272                    10 .c
36:e0:59:49:b6:4a               581                     10 .c
9c:31:bc:8a:39:94               3296                    10 .c
01:f0:56:3a:e1:a9               1140                    10 .c
Whole File Hash: 4b28b44ae03d

Answer 4

awk  --re-interval '
/^File/{                                 #If the beginning of line matches "File"
    s=gensub("[^.]+(.*)","\\1","1",$0);  #Gain the keywords like ".c,.jar" and assign them to s
} 
/(..:){3,}/{                             #If line matches "**:" three times or more
    gsub("[0-9]+$","&\t\t\t\t\t" s,$0)   #At the end add s
}
1' file                                  #Print line

Specific data formatting using awk or sed

Question

4 answers

solution1
2 2016-10-04 01:41:43

solution2
2 ACCPTED 2016-10-04 02:01:41

solution3
0 2016-10-04 04:54:35

solution4
0 2016-10-04 12:15:51

Specific data formatting using awk or sed

Question

4 answers

solution1 2 2016-10-04 01:41:43

solution2 2 ACCPTED 2016-10-04 02:01:41

solution3 0 2016-10-04 04:54:35

solution4 0 2016-10-04 12:15:51

solution1
2 2016-10-04 01:41:43

solution2
2 ACCPTED 2016-10-04 02:01:41

solution3
0 2016-10-04 04:54:35

solution4
0 2016-10-04 12:15:51