Bash正则表达式捕获组

Question

I have a single string that is this kind of format: 我有一个这样的格式的字符串：

"Mike H<michael.haken@email1.com>" michael.haken@email2.com "Mike H<hakenmt@email1.com>"

If I was writing a normal regex in JS, C#, etc, I'd do this 如果我用JS，C＃等编写普通的正则表达式，我会这样做

(?:"(.+?)"|'(.+?)'|(\S+))

And iterate the match groups to grab each string, ideally without the quotes. 并重复匹配组以获取每个字符串，理想情况下不带引号。 I ultimately want to add each value to an array, so in the example, I'd end up with 3 items in an array as follows: 我最终希望将每个值添加到数组中，因此在示例中，我将在数组中获得3个项目，如下所示：

Mike H<michael.haken@email1.com>
michael.haken@email2.com 
Mike H<hakenmt@email1.com>

I can't figure out how to replicate this functionality with grep or sed or bash regex's. 我不知道如何使用grep或sed或bash regex复制此功能。 I've tried some things like 我已经尝试过类似

echo "$email" | grep -oP "\"\K(.+?)(?=\")|'\K(.+?)(?=')|(\S+)"

The problem with this is that while it kind of mimics the functionality of capture groups, it doesn't really work with multiples, so I get captures like 问题在于，尽管它模仿了捕获组的功能，但实际上并不能与倍增组一起使用，所以我得到了类似的捕获

"Mike
H<michael.haken@email1.com>"
 michael.haken@email2.com

If I remove the look ahead/behind logic, I at least get the 3 strings, but the first and last are still wrapped in quotes. 如果删除向前/向后逻辑，我至少会得到3个字符串，但是第一个和最后一个仍然用引号引起来。 In that approach, I pipe the output to read so I can individually add each string to the array, but I'm open to other options. 在这种方法中，我通过管道将输出read以便可以将每个字符串分别添加到数组中，但是我可以使用其他选项。

EDIT: 编辑：

I think my input example may have been confusing, it's just a possible input. 我认为我的输入示例可能令人困惑，这只是可能的输入。 The real input could be double quoted, single quoted, or non-quoted (without spaces) strings in any order with any quantity. 实际输入可以是双引号，单引号或不带引号（无空格）的字符串，其顺序为任意数量。 The Javascript/C# regex I provided is the real behavior I'm trying to achieve. 我提供的Javascript / C＃正则表达式是我想要实现的真实行为。

Answer 1

You can use Perl: 您可以使用Perl：

$ email='"Mike H<michael.haken@email1.com>" michael.haken@email2.com "Mike H<hakenmt@email1.com>"'
$ echo "$email" | perl -lane 'while (/"([^"]+)"|(\S+)/g) {print $1 ? $1 : $2}' 
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>

Or in pure Bash, it gets kinda wordy: 或在纯Bash中，它有点罗word：

re='\"([^\"]+)\"[[:space:]]*|([^[:space:]]+)[[:space:]]*'
while [[ $email =~ $re ]]; do
    echo ${BASH_REMATCH[1]}${BASH_REMATCH[2]}
    i=${#BASH_REMATCH}
    email=${email:i}
done 
# same output

Answer 2

Your first expression is fine; 您的第一个表情很好； just be careful with the quotes (use single quotes when \\ are present). 只需小心使用引号（当\\出现时使用单引号）。 In the end trim the " with sed. 最后用sed修剪" 。

$ echo $mail | grep -Po '".*?"|\S+' | sed -r 's/"$|^"//g'
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>

Answer 3

gawk + bash solution (adding each item to array): gawk + bash解决方案（将每个项目添加到数组）：

email_str='"Mike H<michael.haken@email1.com>" michael.haken@email2.com "Mike H<hakenmt@email1.com>"'

readarray -t email_arr < <(awk -v FPAT="[^\"'[:space:]]+[^\"']+[^\"'[:space:]]+" \
                         '{ for(i=1;i<=NF;i++) print $i }' <<<$email_str)

Now, all items are in email_arr 现在，所有项目都在email_arr

Accessing the 2nd item: 访问第二项：

echo "${email_arr[1]}"
michael.haken@email2.com

Accessing the 3rd item: 访问第三个项目：

echo "${email_arr[3]}"
Mike H<hakenmt@email1.com>

Answer 4

You may use sed to achieve that, 您可以使用sed实现此目的，

$ sed -r 's/"(.*)" (.*)"(.*)"/\1\n\2\n\3/g' <<< "$EMAIL"
Mike H<michael.haken@email1.com>
michael.haken@email2.com 
Mike H<hakenmt@email1.com>

Answer 5

Using gawk where you can set multi-line RS . 使用gawk可以在其中设置多行RS 。

awk -v RS='"|" ' 'NF' inputfile
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>

Answer 6

Modify your regex like this : 像这样修改您的正则表达式：

grep -oP '("?\s*)\K.*?(?=")' file

Output: 输出：

Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>

Answer 7

Using GNU awk and FPAT to define fields by content : 使用GNU awk和FPAT 通过内容定义字段 ：

$ awk '
BEGIN { FPAT="([^ ]*)|(\"[^\"]*\")" }  # define a field to be space-separated or in quotes
{
    for(i=1;i<=NF;i++) {               # iterate every field
        gsub(/^\"|\"$/,"",$i)          # remove leading and trailing quotes
        print $i                       # output
    }
}' file
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>

Answer 8

What I was able to do that worked, but wasn't as concise as I wanted the code to be: 我能够做到的事情行得通，但并没有我想要的代码那么简洁：

arr=()
while read line; do
  line="${line//\"/}"
  arr+=("${line//\'/}")
done < <(echo $email | grep -oP "\"(.+?)\"|'(.+?)'|(\S+)")

This gave me an array of the capturing group and handled the input in any order, wrapped in double or single quotes or none at all if it didn't have a space. 这给了我一组捕获组的数组，并以任意顺序处理输入，用双引号或单引号引起来，如果没有空格则根本不使用。 It also provided the elements in the array without the wrapping quotes. 它还提供了数组中的元素，但没有包装引号。 Appreciate all of the suggestions. 赞赏所有建议。

Bash正则表达式捕获组

问题描述

8 个解决方案

解决方案1
3 2017-09-25 03:25:05

解决方案2
1 2017-09-25 07:21:07

解决方案3
1 2017-09-25 07:47:38

解决方案4
0 2017-09-25 02:59:36

解决方案5
0 2017-09-25 06:46:05

解决方案6
0 2017-09-25 06:58:08

解决方案7
0 2017-09-25 11:30:33

解决方案8
0 已采纳 2017-09-25 16:00:25

Bash正则表达式捕获组

问题描述

8 个解决方案

解决方案1 3 2017-09-25 03:25:05

解决方案2 1 2017-09-25 07:21:07

解决方案3 1 2017-09-25 07:47:38

解决方案4 0 2017-09-25 02:59:36

解决方案5 0 2017-09-25 06:46:05

解决方案6 0 2017-09-25 06:58:08

解决方案7 0 2017-09-25 11:30:33

解决方案8 0 已采纳 2017-09-25 16:00:25

解决方案1
3 2017-09-25 03:25:05

解决方案2
1 2017-09-25 07:21:07

解决方案3
1 2017-09-25 07:47:38

解决方案4
0 2017-09-25 02:59:36

解决方案5
0 2017-09-25 06:46:05

解决方案6
0 2017-09-25 06:58:08

解决方案7
0 2017-09-25 11:30:33

解决方案8
0 已采纳 2017-09-25 16:00:25