how to loop through string for patterns from linux shell?

Question

I have a script that looks through files in a directory for strings like :tagName: which works fine for single :tag: but not for multiple :tagOne:tagTwo:tagThree: tags.

My current script does:

grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|.*(:[Aa-Zz]*:)|\1|g' | \
sort -u
printf '\nNote: this fails to display combined :tagOne:tagTwo:etcTag:\n'

The first line is generating an output like this:

:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:

And the objective is to get that into a list of single :tag: 's.

Again, the problem is that if a line has multiple tags, the line does not appear in the output at all (as opposed to the problem merely being that only the first tag of the line gets displayed). Obviously the | sed... | | sed... | there is problematic.

**I want :tagOne:tagTwo:etcTag: to be turned this into:

:tagOne:
:tagTwo:
:etcTag:

and so forth with :politics:violence: etc.

Colons aren't necessary, tagOne is just as good (maybe better, but this is trivial) than :tagOne: .

The problem is that if a line has multiple tags, the line does not appear in the output at all (as opposed to the problem merely being that only the first tag of the line gets displayed). Obviously the | sed... | | sed... | there is problematic.

So I should replace the sed with something better...

I've tried :

A smarter sed:

grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
  sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
  sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
  sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
  sort -u

...which works (for a limited number of tags) except that it produces weird results like:

:toxicity:p:
:somewhat:y:
:people:n:

...placing weird random letters at the end of some tags in which :p: is the final character of the :leadership: tag and "leadership" no longer appears in the list. Same for :y: and :n: .

I've also tried using loops in a couple ways...

grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
  sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
  sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
  sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
  sort -u | grep lead

...which has the same problem of :leadership: tags being lost etc. And like...

for m in $(grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd); do
  for t in $(echo $m | grep -e ':[Aa-Zz]*:'); do
    printf "$t\n";
  done
done | sort -u

...which doesn't separate the tags at all, just prints stuff like: :truama:leadership:business:toxicity

Should I be taking some other approach? Using a different utility (perhaps cut inside a loop)? Maybe doing this in python (I have a few python scripts but don't know the language well, but maybe this would be easy to do that way)? Every time I see awk I think "EEK," so I'd prefer a non-awk solution please. preferring to stick to paradigms I've used in order to learn them better.

Answer 1

Using PCRE in grep (where available) and positive lookbehind :

$ echo :tagOne:tagTwo:tagThree: |  grep -Po "(?<=:)[^:]+:"
tagOne:
tagTwo:
tagThree:

You will lose the leading: but get the tags nevertheless.

Edit : Did someone mention awk?:

$ awk '{
    while(match($0,/:[^:]+:/)) {
        a[substr($0,RSTART,RLENGTH)]
        $0=substr($0,RSTART+1)
    }
}
END {
    for(i in a)
        print i
}' file

Answer 2

Another idea using awk ...

Sample data generated by OPs initial grep :

$ cat tags.raw
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:

One awk idea:

awk '
    { split($0,tmp,":")                     # split input on colon;
                                            # NOTE: fields #1 and #NF are the empty string - see END block
      for ( x in tmp )                      # loop through tmp[] indices
          { arr[tmp[x]] }                   # store tmp[] values as  arr[] indices; this eliminates duplicates
    }
END { delete arr[""]                        # remove the empty string from arr[]
      for ( i in arr )                      # loop through arr[] indices
          { printf ":%s:\n", i }            # print each tag on separate line leading/trailing colons
    }
' tags.raw | sort                           # sort final output

NOTE : I'm not up to speed on awk's ability to internally sort arrays (thus eliminating the external sort call) so open to suggestions (or someone can copy this answer to a new one and update with said ability?)

The above also generates:

:babylon:
:business:
:etcTag:
:family:
:leadership:
:politics:
:positivity:
:psychology:
:socialServices:
:somewhat:
:strategy:
:tagOne:
:tagTwo:
:tech:
:therapy:
:toxicity:
:trauma:
:triggered:
:truama:
:unfurling:
:violence:

Answer 3

Sample data generated by OPs initial grep :

$ cat tags.raw
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:

One while/for/printf idea based on associative arrays:

unset arr
typeset -A arr                          # declare array named 'arr' as associative

while read -r line                      # for each line from tags.raw ...
do
    for word in ${line//:/ }            # replace ":" with space and process each 'word' separately
    do
        arr[${word}]=1                  # create/overwrite arr[$word] with value 1;
                                        # objective is to make sure we have a single entry in arr[] for $word;
                                        # this eliminates duplicates
    done
done < tags.raw

printf ":%s:\n" "${!arr[@]}" | sort     # pass array indices (ie, our unique list of words) to printf;
                                        # per OPs desired output we'll bracket each word with a pair of ':';
                                        # then sort

Per OPs comment/question about removing the array, a twist on the above where we eliminate the array in favor of printing from the internal loop and then piping everything to sort -u :

while read -r line                      # for each line from tags.raw ...
do
    for word in ${line//:/ }            # replace ":" with space and process each 'word' separately
    do
        printf ":%s:\n" "${word}"       # print ${word} to stdout
    done
done < tags.raw | sort -u               # pipe all output (ie, list of ${word}s for sorting and removing dups

Both of the above generates:

:babylon:
:business:
:etcTag:
:family:
:leadership:
:politics:
:positivity:
:psychology:
:socialServices:
:somewhat:
:strategy:
:tagOne:
:tagTwo:
:tech:
:therapy:
:toxicity:
:trauma:
:triggered:
:truama:
:unfurling:
:violence:

Answer 4

A pipe through tr can split those strings out to separate lines:

grep -hx -- ':[:[:alnum:]]*:' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'

This will also remove the colons and an empty line will be present in the output (easy to repair, note the empty line will always be the first one due to the leading : ). Add sort -u to sort and remove duplicates, or awk '!seen[$0]++' to remove duplicates without sorting.

An approach with sed :

sed '/^:/!d;s///;/:$/!d;s///;y/:/\n/' ~/Documents/wiki{,/diary}/*.mkd

This also removes colons, but avoids adding empty lines (by removing the leading/trailing : with s before using y to transliterate remaining : to <newline> ). sed could be combined with tr:

sed '/:$/!d;/^:/!d;s///' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'

Using awk to work with the : separated fields, removing duplicates:

awk -F: '/^:/ && /:$/ {for (i=2; i<NF; ++i) if (!seen[$i]++) print $i}' \
~/Documents/wiki{,/diary}/*.mkd

how to loop through string for patterns from linux shell?

Question

4 answers

solution1
5 2020-11-28 18:16:07

solution2
3 2020-11-28 18:55:07

solution3
2 2020-11-28 18:40:02

solution4
2 ACCPTED 2020-11-29 03:02:13

how to loop through string for patterns from linux shell?

Question

4 answers

solution1 5 2020-11-28 18:16:07

solution2 3 2020-11-28 18:55:07

solution3 2 2020-11-28 18:40:02

solution4 2 ACCPTED 2020-11-29 03:02:13

solution1
5 2020-11-28 18:16:07

solution2
3 2020-11-28 18:55:07

solution3
2 2020-11-28 18:40:02

solution4
2 ACCPTED 2020-11-29 03:02:13