简体   繁体   中英

Parsing simple string with awk or sed in linux

original string:
A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/

Depth of directories will vary, but /trunk part will always remain the same. And a single character in front of /trunk is the indicator of that line.

desired output:

A /trunk/apple
B /trunk/apple
Z /trunk/orange
Q /trunk/melon/juice/venti/straw

*** edit
I'm sorry I made a mistake by adding a slash at the end of each path in the original string which made the output confusing. Original string didn't have the slash in front of the capital letter, but I'll leave it be.

To deal with complex samples input, like where there could be N number of / and values after trunk in a single line please try following.

awk '
{
  gsub(/[^/]*\/trunk/,OFS"&")
  sub(/^ /,"")
  sub(/\//,OFS"&")
  gsub(/ +[^/]*\/trunk\/[^[:space:]]+/,"\n&")
  sub(/\n/,OFS)
  gsub(/\n /,ORS)
  gsub(/\/trunk/,OFS"&")
  sub(/[[:space:]]+/,OFS)
}
1
'  Input_file


With your shown samples, please try following awk code.

awk '{gsub(/\/trunk/,OFS "&");gsub(/trunk\/[^/]*\//,"&\n")} 1' Input_file

With GNU awk for multi-char RS and RT:

$ awk -v RS='([^/]+/){2}[^/\n]+' 'RT{sub("/",OFS,RT); print RT}' file
A trunk/apple
B trunk/apple
Z trunk/orange

I'm setting RS to a regexp describing each string you want to match, ie 2 repetitions of non- / s followed by / and then a final string of non- / s (and non-newline for the last string on the input line). RT is automatically set to each of the matching strings, so then I just change the first / to a blank and print the result.

If each path isn't always 3 levels deep but does always start with something/trunk/ , eg:

$ cat file
A/trunk/apple/banana/B/trunk/apple/Z/trunk/orange

then:

$ awk -v RS='[^/]+/trunk/' 'RT{if (NR>1) print pfx $0; pfx=gensub("/"," ",1,RT)} END{printf "%s%s", pfx, $0}' file
A trunk/apple/banana/
B trunk/apple/
Z trunk/orange

In awk you can try this solution. It deals with the special requirement of removing forward slashes when the next character is upper case. Will not win a design award but works.

$ echo "A/trunk/apple/B/trunk/apple/Z/trunk/orange" | 
    awk -F '' '{ x=""; for(i=1;i<=NF;i++){ 
      if($(i+1)~/[A-Z]/&&$i=="/"){$i=""}; 
      if($i~/[A-Z]/){ printf x""$i" "}
      else{ x="\n"; printf $i } }; print "" }'
A /trunk/apple
B /trunk/apple
Z /trunk/orange

Also works for n words. Actually works with anything that follows the given pattern.

$ echo "A/fruits/apple/mango/B/anything/apple/pear/banana/Z/ball/orange/anything" | 
    awk -F '' '{ x=""; for(i=1;i<=NF;i++){
      if($(i+1)~/[A-Z]/&&$i=="/"){$i=""};
      if($i~/[A-Z]/){ printf x""$i" "}
      else{ x="\n"; printf $i } }; print "" }'
A /fruits/apple/mango
B /anything/apple/pear/banana
Z /ball/orange/anything

This might work for you (GNU sed):

sed 's/[^/]*/& /;s/\//\n/3;P;D' file

Separate the first word from the first / by a space.

Replace the third / by a newline.

Print/delete the first line and repeat.


If the first word has the property that it is only one character long:

sed 's/./& /;s#/\(./\)#\n\1#;P;D' file

Or if the first word has the property that it begins with an upper case character:

sed 's/[[:upper:]][^/]*/& /;s#/\([[:upper:][^/]*/\)#\n\1#;P;D' file

Or if the first word has the property that it is followed by /trunk/ :

sed -E 's#([^/]*)(/trunk/)#\n\1 \2#g;s/.//' file

Using gnu awk you could use FPAT to set contents of each field using a pattern.

When looping the fields, replace the first / with /

str1="A/trunk/apple/B/trunk/apple/Z/trunk/orange"

echo $str1 | awk -v FPAT='[^/]+/trunk/[^/]+' '{    
for(i=1;i<=NF;i++) {
    sub("/", " /", $i)
    print $i
    }
}'

The pattern matches

  • [^/]+ Match any char except /
  • /trunk/[^/]+ Match /trunk/ and any char except /

Output

A  /trunk/apple
B  /trunk/apple
Z  /trunk/orange

With GNU sed:

$ str="A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/"
$ sed -E 's|/?(.)(/trunk/)|\n\1 \2|g;s|/$||' <<< "$str"

A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw

Note the first empty output line. If it is undesirable we can separate the processing of the first output line:

$ sed -E 's|(.)|\1 |;s|/(.)(/trunk/)|\n\1 \2|g;s|/$||' <<< "$str"
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw

With awk using gsub() and sub() functions:

awk '
{
gsub(/[[:upper:]]{1}/,"& ")
sub(/[[:upper:]]{1}$/,"\n&",$2)
sub(/[[:upper:]]{1}$/,"\n&",$3)
$1=$1
gsub(/[/]\n/,"\n")
} 1' file
A /trunk/apple
B /trunk/apple
Z /trunk/orange
  • first gsub() is applied by default to $0 .
  • then we use the same regexp in sub() for $2 and $3 fields.
  • rebuild: $1=$1 .
  • finally, we remove the / at the end.

Assuming your data will always be in the format provided as a single string, you can try this sed .

$ sed 's/$/\//;s|\([A-Z]\)\([a-z/]*\)/\([a-z]*\?\)|\1 \2\3\n|g' input_file
$ echo "A/trunk/apple/pine/skunk/B/trunk/runk/bunk/apple/Z/trunk/orange/T/fruits/apple/mango/P/anything/apple/pear/banana/L/ball/orange/anything/S/fruits/apple/mango/B/rupert/cream/travel/scout/H/tall/mountains/pottery/barnes" | sed 's/$/\//;s|\([A-Z]\)\([a-z/]*\)/\([a-z]*\?\)|\1 \2\3\n|g'
A /trunk/apple/pine/skunk
B /trunk/runk/bunk/apple
Z /trunk/orange
T /fruits/apple/mango
P /anything/apple/pear/banana
L /ball/orange/anything
S /fruits/apple/mango
B /rupert/cream/travel/scout
H /tall/mountains/pottery/barnes

Some fun with perl, where you can using nonconsuming regex to autosplit into the @F array, then just print however you want.

perl -lanF'/(?=.{1,2}trunk)/' -e 'print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2'

Step #1: Split

  • perl -lanF/(?=.{1,2}trunk)/'
  • This will take the input stream, and split each line whenever the pattern .{1,2}trunk is encountered
  • Because we want to retain trunk and the preceeding 1 or 2 chars, we wrap the split pattern in the (?=) for a non-consuming forward lookahead
  • This splits things up this way:
$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e 'print join " ", @F'
A /trunk/apple/ B /trunk/apple/ Z /trunk/orange/citrus/ Q /trunk/melon/juice/venti/straw/

Step 2: Format output:

  • The @F array contains pairs that we want to print in order, so we'll iterate half of the array indices, and print 2 at a time:
  • print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2 --> Double the iterator, and print pairs
  • using perl -l means each print has an implicit \n at the end
  • The results:
$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e 'print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2'
A /trunk/apple/
B /trunk/apple/
Z /trunk/orange/citrus/
Q /trunk/melon/juice/venti/straw/

Endnote: Perl obfuscation that didn't work.

  • Any array in perl can be cast as a hash, of the format (key,val,key,val....)
  • So %F=@F; print "$_ $F{$_}" for keys %F %F=@F; print "$_ $F{$_}" for keys %F seems like it would be really slick
  • But you lose order:
$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e '%F=@F; print "$_ $F{$_}" for keys %F'
Z /trunk/orange/citrus/
A /trunk/apple/
Q /trunk/melon/juice/venti/straw/
B /trunk/apple/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM