简体   繁体   English

Bash正则表达式捕获组

[英]Bash Regex Capture Groups

I have a single string that is this kind of format: 我有一个这样的格式的字符串:

"Mike H<michael.haken@email1.com>" michael.haken@email2.com "Mike H<hakenmt@email1.com>"

If I was writing a normal regex in JS, C#, etc, I'd do this 如果我用JS,C#等编写普通的正则表达式,我会这样做

(?:"(.+?)"|'(.+?)'|(\S+))

And iterate the match groups to grab each string, ideally without the quotes. 并重复匹配组以获取每个字符串,理想情况下不带引号。 I ultimately want to add each value to an array, so in the example, I'd end up with 3 items in an array as follows: 我最终希望将每个值添加到数组中,因此在示例中,我将在数组中获得3个项目,如下所示:

Mike H<michael.haken@email1.com>
michael.haken@email2.com 
Mike H<hakenmt@email1.com>

I can't figure out how to replicate this functionality with grep or sed or bash regex's. 我不知道如何使用grepsed或bash regex复制此功能。 I've tried some things like 我已经尝试过类似

echo "$email" | grep -oP "\"\K(.+?)(?=\")|'\K(.+?)(?=')|(\S+)"

The problem with this is that while it kind of mimics the functionality of capture groups, it doesn't really work with multiples, so I get captures like 问题在于,尽管它模仿了捕获组的功能,但实际上并不能与倍增组一起使用,所以我得到了类似的捕获

"Mike
H<michael.haken@email1.com>"
 michael.haken@email2.com 

If I remove the look ahead/behind logic, I at least get the 3 strings, but the first and last are still wrapped in quotes. 如果删除向前/向后逻辑,我至少会得到3个字符串,但是第一个和最后一个仍然用引号引起来。 In that approach, I pipe the output to read so I can individually add each string to the array, but I'm open to other options. 在这种方法中,我通过管道将输出read以便可以将每个字符串分别添加到数组中,但是我可以使用其他选项。

EDIT: 编辑:

I think my input example may have been confusing, it's just a possible input. 我认为我的输入示例可能令人困惑,这只是可能的输入。 The real input could be double quoted, single quoted, or non-quoted (without spaces) strings in any order with any quantity. 实际输入可以是双引号,单引号或不带引号(无空格)的字符串,其顺序为任意数量。 The Javascript/C# regex I provided is the real behavior I'm trying to achieve. 我提供的Javascript / C#正则表达式是我想要实现的真实行为。

You can use Perl: 您可以使用Perl:

$ email='"Mike H<michael.haken@email1.com>" michael.haken@email2.com "Mike H<hakenmt@email1.com>"'
$ echo "$email" | perl -lane 'while (/"([^"]+)"|(\S+)/g) {print $1 ? $1 : $2}' 
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>

Or in pure Bash, it gets kinda wordy: 或在纯Bash中,它有点罗word:

re='\"([^\"]+)\"[[:space:]]*|([^[:space:]]+)[[:space:]]*'
while [[ $email =~ $re ]]; do
    echo ${BASH_REMATCH[1]}${BASH_REMATCH[2]}
    i=${#BASH_REMATCH}
    email=${email:i}
done 
# same output

Your first expression is fine; 您的第一个表情很好; just be careful with the quotes (use single quotes when \\ are present). 只需小心使用引号(当\\出现时使用单引号)。 In the end trim the " with sed. 最后用sed修剪"

$ echo $mail | grep -Po '".*?"|\S+' | sed -r 's/"$|^"//g'
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>

gawk + bash solution (adding each item to array): gawk + bash解决方案(将每个项目添加到数组):

email_str='"Mike H<michael.haken@email1.com>" michael.haken@email2.com "Mike H<hakenmt@email1.com>"'

readarray -t email_arr < <(awk -v FPAT="[^\"'[:space:]]+[^\"']+[^\"'[:space:]]+" \
                         '{ for(i=1;i<=NF;i++) print $i }' <<<$email_str)

Now, all items are in email_arr 现在,所有项目都在email_arr

Accessing the 2nd item: 访问第二项:

echo "${email_arr[1]}"
michael.haken@email2.com

Accessing the 3rd item: 访问第三个项目:

echo "${email_arr[3]}"
Mike H<hakenmt@email1.com>

You may use sed to achieve that, 您可以使用sed实现此目的,

$ sed -r 's/"(.*)" (.*)"(.*)"/\1\n\2\n\3/g' <<< "$EMAIL"
Mike H<michael.haken@email1.com>
michael.haken@email2.com 
Mike H<hakenmt@email1.com>

Using gawk where you can set multi-line RS . 使用gawk可以在其中设置多行RS

awk -v RS='"|" ' 'NF' inputfile
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>

Modify your regex like this : 像这样修改您的正则表达式:

grep -oP '("?\s*)\K.*?(?=")' file

Output: 输出:

Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>

Using GNU awk and FPAT to define fields by content : 使用GNU awk和FPAT 通过内容定义字段

$ awk '
BEGIN { FPAT="([^ ]*)|(\"[^\"]*\")" }  # define a field to be space-separated or in quotes
{
    for(i=1;i<=NF;i++) {               # iterate every field
        gsub(/^\"|\"$/,"",$i)          # remove leading and trailing quotes
        print $i                       # output
    }
}' file
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>

What I was able to do that worked, but wasn't as concise as I wanted the code to be: 我能够做到的事情行得通,但并没有我想要的代码那么简洁:

arr=()
while read line; do
  line="${line//\"/}"
  arr+=("${line//\'/}")
done < <(echo $email | grep -oP "\"(.+?)\"|'(.+?)'|(\S+)")

This gave me an array of the capturing group and handled the input in any order, wrapped in double or single quotes or none at all if it didn't have a space. 这给了我一组捕获组的数组,并以任意顺序处理输入,用双引号或单引号引起来,如果没有空格则根本不使用。 It also provided the elements in the array without the wrapping quotes. 它还提供了数组中的元素,但没有包装引号。 Appreciate all of the suggestions. 赞赏所有建议。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM