如何遍歷來自 linux shell 的模式的字符串？

Question

我有一個腳本可以查看目錄中的文件以查找:tagName:之類的字符串，它適用於單個:tag:但不適用於多個:tagOne:tagTwo:tagThree:標簽。

我當前的腳本是：

grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
sed -r 's|.*(:[Aa-Zz]*:)|\1|g' | \
sort -u
printf '\nNote: this fails to display combined :tagOne:tagTwo:etcTag:\n'

第一行生成一個 output ，如下所示：

:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:

目標是將其放入單個:tag:的列表中。

同樣，問題是如果一行有多個標簽，則該行根本不會出現在 output 中（與僅顯示該行的第一個標簽相反的問題相反）。 顯然| sed... | | sed... | 有問題。

**我想:tagOne:tagTwo:etcTag:變成：

:tagOne:
:tagTwo:
:etcTag:

等等:politics:violence:等。

冒號不是必需的， tagOne與:tagOne:一樣好（也許更好，但這很簡單）。

問題是，如果一行有多個標簽，則該行根本不會出現在 output 中（與僅顯示該行的第一個標簽相反的問題相反）。 顯然| sed... | | sed... | 有問題。

所以我應該用更好的東西替換sed ...

我試過了：

更智能的 sed：

grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
  sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
  sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
  sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
  sort -u

...有效（對於有限數量的標簽），除了它會產生奇怪的結果，例如：

:toxicity:p:
:somewhat:y:
:people:n:

...在一些標簽的末尾放置奇怪的隨機字母，其中:p:是:leadership:標簽的最后一個字符，並且“leadership”不再出現在列表中。 :y:和:n:相同。

我也嘗試過以幾種方式使用循環......

grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd | \
  sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
  sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
  sed -r 's|(:[Aa-Zz]*:)([Aa-Zz]*:)|\1\r:\2|g' | \
  sort -u | grep lead

...具有相同的問題:leadership:標簽丟失等。就像...

for m in $(grep -rh -e '^:\S*:$' ~/Documents/wiki/*.mkd ~/Documents/wiki/diary/*.mkd); do
  for t in $(echo $m | grep -e ':[Aa-Zz]*:'); do
    printf "$t\n";
  done
done | sort -u

...根本不分離標簽，只打印如下內容： :truama:leadership:business:toxicity

我應該采取其他方法嗎？ 使用不同的實用程序（可能在循環內cut ）？ 也許在 python 中執行此操作（我有一些 python 腳本，但不太了解語言，但也許這樣做很容易）？ 每次我看到awk時，我都會想到“EEK”，所以我更喜歡非 awk 解決方案。 為了更好地學習它們，我更願意堅持我使用過的范式。

Answer 1

在grep （如果可用）中使用 PCRE 並積極向后看：

$ echo :tagOne:tagTwo:tagThree: |  grep -Po "(?<=:)[^:]+:"
tagOne:
tagTwo:
tagThree:

您將失去領先地位：但仍然獲得標簽。

編輯：有人提到 awk 嗎？

$ awk '{
    while(match($0,/:[^:]+:/)) {
        a[substr($0,RSTART,RLENGTH)]
        $0=substr($0,RSTART+1)
    }
}
END {
    for(i in a)
        print i
}' file

Answer 2

使用awk的另一個想法......

OP 初始grep生成的樣本數據：

$ cat tags.raw
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:

一個awk想法：

awk '
    { split($0,tmp,":")                     # split input on colon;
                                            # NOTE: fields #1 and #NF are the empty string - see END block
      for ( x in tmp )                      # loop through tmp[] indices
          { arr[tmp[x]] }                   # store tmp[] values as  arr[] indices; this eliminates duplicates
    }
END { delete arr[""]                        # remove the empty string from arr[]
      for ( i in arr )                      # loop through arr[] indices
          { printf ":%s:\n", i }            # print each tag on separate line leading/trailing colons
    }
' tags.raw | sort                           # sort final output

注意：我不了解awk's能力（從而消除了外部sort調用），因此對建議持開放態度（或者有人可以將此答案復制到新答案並使用所述能力進行更新？）

以上還生成：

:babylon:
:business:
:etcTag:
:family:
:leadership:
:politics:
:positivity:
:psychology:
:socialServices:
:somewhat:
:strategy:
:tagOne:
:tagTwo:
:tech:
:therapy:
:toxicity:
:trauma:
:triggered:
:truama:
:unfurling:
:violence:

Answer 3

OP 初始grep生成的樣本數據：

$ cat tags.raw
:politics:violence:
:positivity:
:positivity:somewhat:
:psychology:
:socialServices:family:
:strategy:
:tech:
:therapy:babylon:
:trauma:
:triggered:
:truama:leadership:business:toxicity:
:unfurling:
:tagOne:tagTwo:etcTag:

一個基於關聯 arrays 的while/for/printf想法：

unset arr
typeset -A arr                          # declare array named 'arr' as associative

while read -r line                      # for each line from tags.raw ...
do
    for word in ${line//:/ }            # replace ":" with space and process each 'word' separately
    do
        arr[${word}]=1                  # create/overwrite arr[$word] with value 1;
                                        # objective is to make sure we have a single entry in arr[] for $word;
                                        # this eliminates duplicates
    done
done < tags.raw

printf ":%s:\n" "${!arr[@]}" | sort     # pass array indices (ie, our unique list of words) to printf;
                                        # per OPs desired output we'll bracket each word with a pair of ':';
                                        # then sort

根據關於刪除數組的 OP 評論/問題，我們在上面刪除數組以支持從內部循環打印，然后將所有內容傳遞給sort -u ：

while read -r line                      # for each line from tags.raw ...
do
    for word in ${line//:/ }            # replace ":" with space and process each 'word' separately
    do
        printf ":%s:\n" "${word}"       # print ${word} to stdout
    done
done < tags.raw | sort -u               # pipe all output (ie, list of ${word}s for sorting and removing dups

以上都生成：

:babylon:
:business:
:etcTag:
:family:
:leadership:
:politics:
:positivity:
:psychology:
:socialServices:
:somewhat:
:strategy:
:tagOne:
:tagTwo:
:tech:
:therapy:
:toxicity:
:trauma:
:triggered:
:truama:
:unfurling:
:violence:

Answer 4

通過tr的 pipe 可以將這些字符串拆分為單獨的行：

grep -hx -- ':[:[:alnum:]]*:' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'

這也將刪除冒號，並且 output 中將出現一個空行（易於修復，請注意，由於前導: ，空行將始終是第一個行）。 添加sort -u以排序和刪除重復項，或awk '!seen[$0]++'以刪除重復項而不進行排序。

sed的方法：

sed '/^:/!d;s///;/:$/!d;s///;y/:/\n/' ~/Documents/wiki{,/diary}/*.mkd

這也刪除了冒號，但避免了添加空行（通過在使用y將剩余的:音譯為<newline>之前刪除前導/尾隨:與s ）。 sed 可以與 tr 結合使用：

sed '/:$/!d;/^:/!d;s///' ~/Documents/wiki{,/diary}/*.mkd | tr -s ':' '\n'

使用awk處理:分隔字段，刪除重復項：

awk -F: '/^:/ && /:$/ {for (i=2; i<NF; ++i) if (!seen[$i]++) print $i}' \
~/Documents/wiki{,/diary}/*.mkd

如何遍歷來自 linux shell 的模式的字符串？

問題描述

4 個解決方案

解決方案1
5 2020-11-28 18:16:07

解決方案2
3 2020-11-28 18:55:07

解決方案3
2 2020-11-28 18:40:02

解決方案4
2 已采納 2020-11-29 03:02:13

如何遍歷來自 linux shell 的模式的字符串？

問題描述

4 個解決方案

解決方案1 5 2020-11-28 18:16:07

解決方案2 3 2020-11-28 18:55:07

解決方案3 2 2020-11-28 18:40:02

解決方案4 2 已采納 2020-11-29 03:02:13

解決方案1
5 2020-11-28 18:16:07

解決方案2
3 2020-11-28 18:55:07

解決方案3
2 2020-11-28 18:40:02

解決方案4
2 已采納 2020-11-29 03:02:13