[英]Bash Regex Capture Groups
I have a single string that is this kind of format: 我有一个这样的格式的字符串:
"Mike H<michael.haken@email1.com>" michael.haken@email2.com "Mike H<hakenmt@email1.com>"
If I was writing a normal regex in JS, C#, etc, I'd do this 如果我用JS,C#等编写普通的正则表达式,我会这样做
(?:"(.+?)"|'(.+?)'|(\S+))
And iterate the match groups to grab each string, ideally without the quotes. 并重复匹配组以获取每个字符串,理想情况下不带引号。 I ultimately want to add each value to an array, so in the example, I'd end up with 3 items in an array as follows:
我最终希望将每个值添加到数组中,因此在示例中,我将在数组中获得3个项目,如下所示:
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>
I can't figure out how to replicate this functionality with grep
or sed
or bash regex's. 我不知道如何使用
grep
或sed
或bash regex复制此功能。 I've tried some things like 我已经尝试过类似
echo "$email" | grep -oP "\"\K(.+?)(?=\")|'\K(.+?)(?=')|(\S+)"
The problem with this is that while it kind of mimics the functionality of capture groups, it doesn't really work with multiples, so I get captures like 问题在于,尽管它模仿了捕获组的功能,但实际上并不能与倍增组一起使用,所以我得到了类似的捕获
"Mike
H<michael.haken@email1.com>"
michael.haken@email2.com
If I remove the look ahead/behind logic, I at least get the 3 strings, but the first and last are still wrapped in quotes. 如果删除向前/向后逻辑,我至少会得到3个字符串,但是第一个和最后一个仍然用引号引起来。 In that approach, I pipe the output to
read
so I can individually add each string to the array, but I'm open to other options. 在这种方法中,我通过管道将输出
read
以便可以将每个字符串分别添加到数组中,但是我可以使用其他选项。
EDIT: 编辑:
I think my input example may have been confusing, it's just a possible input. 我认为我的输入示例可能令人困惑,这只是可能的输入。 The real input could be double quoted, single quoted, or non-quoted (without spaces) strings in any order with any quantity.
实际输入可以是双引号,单引号或不带引号(无空格)的字符串,其顺序为任意数量。 The Javascript/C# regex I provided is the real behavior I'm trying to achieve.
我提供的Javascript / C#正则表达式是我想要实现的真实行为。
You can use Perl: 您可以使用Perl:
$ email='"Mike H<michael.haken@email1.com>" michael.haken@email2.com "Mike H<hakenmt@email1.com>"'
$ echo "$email" | perl -lane 'while (/"([^"]+)"|(\S+)/g) {print $1 ? $1 : $2}'
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>
Or in pure Bash, it gets kinda wordy: 或在纯Bash中,它有点罗word:
re='\"([^\"]+)\"[[:space:]]*|([^[:space:]]+)[[:space:]]*'
while [[ $email =~ $re ]]; do
echo ${BASH_REMATCH[1]}${BASH_REMATCH[2]}
i=${#BASH_REMATCH}
email=${email:i}
done
# same output
Your first expression is fine; 您的第一个表情很好; just be careful with the quotes (use single quotes when
\\
are present). 只需小心使用引号(当
\\
出现时使用单引号)。 In the end trim the "
with sed. 最后用sed修剪
"
。
$ echo $mail | grep -Po '".*?"|\S+' | sed -r 's/"$|^"//g'
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>
gawk + bash solution (adding each item to array): gawk + bash解决方案(将每个项目添加到数组):
email_str='"Mike H<michael.haken@email1.com>" michael.haken@email2.com "Mike H<hakenmt@email1.com>"'
readarray -t email_arr < <(awk -v FPAT="[^\"'[:space:]]+[^\"']+[^\"'[:space:]]+" \
'{ for(i=1;i<=NF;i++) print $i }' <<<$email_str)
Now, all items are in email_arr
现在,所有项目都在
email_arr
Accessing the 2nd item: 访问第二项:
echo "${email_arr[1]}"
michael.haken@email2.com
Accessing the 3rd item: 访问第三个项目:
echo "${email_arr[3]}"
Mike H<hakenmt@email1.com>
You may use sed
to achieve that, 您可以使用
sed
实现此目的,
$ sed -r 's/"(.*)" (.*)"(.*)"/\1\n\2\n\3/g' <<< "$EMAIL"
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>
Using gawk
where you can set multi-line RS
. 使用
gawk
可以在其中设置多行RS
。
awk -v RS='"|" ' 'NF' inputfile
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>
Modify your regex like this : 像这样修改您的正则表达式:
grep -oP '("?\s*)\K.*?(?=")' file
Output: 输出:
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>
Using GNU awk and FPAT
to define fields by content : 使用GNU awk和
FPAT
通过内容定义字段 :
$ awk '
BEGIN { FPAT="([^ ]*)|(\"[^\"]*\")" } # define a field to be space-separated or in quotes
{
for(i=1;i<=NF;i++) { # iterate every field
gsub(/^\"|\"$/,"",$i) # remove leading and trailing quotes
print $i # output
}
}' file
Mike H<michael.haken@email1.com>
michael.haken@email2.com
Mike H<hakenmt@email1.com>
What I was able to do that worked, but wasn't as concise as I wanted the code to be: 我能够做到的事情行得通,但并没有我想要的代码那么简洁:
arr=()
while read line; do
line="${line//\"/}"
arr+=("${line//\'/}")
done < <(echo $email | grep -oP "\"(.+?)\"|'(.+?)'|(\S+)")
This gave me an array of the capturing group and handled the input in any order, wrapped in double or single quotes or none at all if it didn't have a space. 这给了我一组捕获组的数组,并以任意顺序处理输入,用双引号或单引号引起来,如果没有空格则根本不使用。 It also provided the elements in the array without the wrapping quotes.
它还提供了数组中的元素,但没有包装引号。 Appreciate all of the suggestions.
赞赏所有建议。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.