简体   繁体   中英

Specific data formatting using awk or sed

I am currently working with large datasets containing file information which is formatted into blocks of data. I am trying to take a piece of data from the file path line and append it as a new column on certain lines. The dataset contains file information formatted like so:

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10
7a:8b:8e:20:7b:38               1982                    10
b9:45:3d:f4:97:88               1849                    10
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10
19:c5:b2:aa:b3:60               613                     10
11:7c:7e:76:4b:d5               1272                    10
36:e0:59:49:b6:4a               581                     10
9c:31:bc:8a:39:94               3296                    10
01:f0:56:3a:e1:a9               1140                    10
Whole File Hash: 4b28b44ae03d

What I am wanting to do is take the file type (.jar and .c in this example) and append it to their respective Chunk Hash lines so the final formatting would look like:

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)       
45:97:2a:60:e3:69               3208                    10                              .jar
7a:8b:8e:20:7b:38               1982                    10                              .jar
b9:45:3d:f4:97:88               1849                    10                              .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)       
e8:b0:cb:6f:76:ff               1344                    10                              .c
19:c5:b2:aa:b3:60               613                     10                              .c
11:7c:7e:76:4b:d5               1272                    10                              .c
36:e0:59:49:b6:4a               581                     10                              .c
9c:31:bc:8a:39:94               3296                    10                              .c
01:f0:56:3a:e1:a9               1140                    10                              .c
Whole File Hash: 4b28b44ae03d

I already have the awk code to pull the file type and the chunk hash lines:

awk 'match($0,/\..+/) {print substr($0,RSTART,RLENGTH)}'

awk '/Chunk Hash/{flag=1;next}/Whole File Hash:/{flag=0}flag'

I am just not sure on how to connect these pieces using awk (or sed) to append the file type as a new column onto each line in their respective data block. Another thing to note is I am trying to do this in a bash script if that makes a difference.

Solution in TXR language:

@(repeat)
@  (cases)
File path: @*path.@suff
Inode Num: @inode
@header
@    (collect)
@hashline
@    (last)
Whole File Hash: @wfh
@    (end)
@    (output)
File path: @path.@suff
Inode Num: @inode
@header
@      (repeat)
@{hashline 88}.@suff
@      (end)
Whole File Hash: @wfh
@    (end)
@  (or)
@other
@  (do (put-line other))
@  (end)
@(end)

Run:

$ txr suffixes.txr data
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10                              .jar
7a:8b:8e:20:7b:38               1982                    10                              .jar
b9:45:3d:f4:97:88               1849                    10                              .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10                              .c
19:c5:b2:aa:b3:60               613                     10                              .c
11:7c:7e:76:4b:d5               1272                    10                              .c
36:e0:59:49:b6:4a               581                     10                              .c
9c:31:bc:8a:39:94               3296                    10                              .c
01:f0:56:3a:e1:a9               1140                    10                              .c
Whole File Hash: 4b28b44ae03d

Here is a (GNU) sed solution:

/File path:/ {         # If line matches "File path:"
    h                  # Copy pattern space to hold space
    s/.*(\.[^.]*)$/\1/ # Remove everything but extension from pattern space
    x                  # Swap pattern space and hold space
}                      # Hold space now contains extension
/Chunk Hash/ {         # If line matches "Chunk Hash"
    n                  # Get next line into pattern space
    :loop              # Anchor for loop
    /Whole File Hash/b # If line matches "Whole File Hash", jump out of loop
    G                  # Append extension from hold space to pattern space
    s/\n/\t\t\t\t/     # Substitute newline with a bunch of tabs
    n                  # Get next line
    b loop             # Jump back to ":loop" label
}

This can be stored in a separate file (say, so.sed ), and has to be called like

sed -r -f so.sed infile

resulting in

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10                              .jar
7a:8b:8e:20:7b:38               1982                    10                              .jar
b9:45:3d:f4:97:88               1849                    10                              .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10                              .c
19:c5:b2:aa:b3:60               613                     10                              .c
11:7c:7e:76:4b:d5               1272                    10                              .c
36:e0:59:49:b6:4a               581                     10                              .c
9c:31:bc:8a:39:94               3296                    10                              .c
01:f0:56:3a:e1:a9               1140                    10                              .c
Whole File Hash: 4b28b44ae03d

Non-GNU seds have to jump through the usual hoops to insert tabs and can't use the -r option (but probably -E , which should be equivalent here; -r was just used for convenience to having to escape () ).

In awk:

$ cat script.awk
/File path/ { 
    match($0,/\..+/)
    ext=substr($0,RSTART,RLENGTH)
} 
/Chunk Hash/ {
    flag=1            # flag on
    print             # print here to...
    next              # avoid printing ext
} 
/Whole File Hash:/ {  
    flag=0            # flag off
} 
flag==1 {
    print $0, ext     # add space here to your liking, left it short...
    next              # ... to show output on screen without sidescrolling
} 1                   # print non-flagged records

Run:

$ awk -f script.awk data.txt
File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/aab17eb15d782d7b/af38f2bcc4998af0/0d8eb680024af333.jar
Inode Num: 22525898
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
45:97:2a:60:e3:69               3208                    10 .jar
7a:8b:8e:20:7b:38               1982                    10 .jar
b9:45:3d:f4:97:88               1849                    10 .jar
Whole File Hash: 865999b40fd9

File path: /d9b50a6f54d5a1f8/7b3d459a3454703c/a6d1040ea2c84e10/afcbe93ced71e5e6/2b517a561f5da8a6/1e82b13443330bb3/12fd3e87b2f62dc8/6e1a9f0b0a281564.c
Inode Num: 31881221
Chunk Hash                      Chunk Size (bytes)      Compression Ratio (tenth)
e8:b0:cb:6f:76:ff               1344                    10 .c
19:c5:b2:aa:b3:60               613                     10 .c
11:7c:7e:76:4b:d5               1272                    10 .c
36:e0:59:49:b6:4a               581                     10 .c
9c:31:bc:8a:39:94               3296                    10 .c
01:f0:56:3a:e1:a9               1140                    10 .c
Whole File Hash: 4b28b44ae03d
awk  --re-interval '
/^File/{                                 #If the beginning of line matches "File"
    s=gensub("[^.]+(.*)","\\1","1",$0);  #Gain the keywords like ".c,.jar" and assign them to s
} 
/(..:){3,}/{                             #If line matches "**:" three times or more
    gsub("[0-9]+$","&\t\t\t\t\t" s,$0)   #At the end add s
}
1' file                                  #Print line

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM