[英]how to use sed, awk, or gawk to print only what is matched?
I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.我看到很多关于如何使用 sed、awk 或 gawk 执行搜索和替换等操作的示例和手册页。
But in my case, I have a regular expression that I want to run against a text file to extract a specific value.但就我而言,我有一个正则表达式,我想针对文本文件运行该表达式以提取特定值。 I don't want to do search-and-replace.
我不想进行搜索和替换。 This is being called from bash.
这是从 bash 调用的。 Let's use an example:
让我们举一个例子:
Example regular expression:正则表达式示例:
.*abc([0-9]+)xyz.*
Example input file:示例输入文件:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly.这听起来很简单,但我无法弄清楚如何正确调用 sed/awk/gawk。 What I was hoping to do, is from within my bash script have:
我希望做的是在我的 bash 脚本中有:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:我尝试过的事情包括:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed
(Mac OS X) didn't work with +
.我的
sed
(Mac OS X) 不适用于+
。 I tried *
instead and I added p
tag for printing match:我尝试了
*
并添加了p
标签来打印匹配:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +
, I would use:为了匹配至少一个没有
+
数字字符,我会使用:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this您可以使用 sed 来执行此操作
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n
don't print the resulting line -n
不打印结果行-r
this makes it so you don't have the escape the capture group parens ()
. -r
这使得你没有转义捕获组 parens ()
。\\1
the capture group match \\1
捕获组匹配/g
global match /g
全局匹配/p
print the result /p
打印结果I wrote a tool for myself that makes this easier我为自己写了一个工具,使这更容易
rip 'abc(\d+)xyz' '$1'
I use perl
to make this easier for myself.我使用
perl
使这对我自己更容易。 eg例如
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n
option instructs Perl to read in one line at a time from STDIN and execute the code.这将运行 Perl,
-n
选项指示 Perl 从 STDIN 一次读取一行并执行代码。 The -e
option specifies the instruction to run. -e
选项指定要运行的指令。
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ( $1
).该指令在读取的行上运行正则表达式,如果匹配,则打印出第一组括号 (
$1
) 的内容。
You can do this will multiple file names on the end also.您也可以这样做,最后也会有多个文件名。 eg
例如
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep
supports it you could use the -o
option to print only the portion of any line that matches your regexp.如果您的
grep
版本支持它,您可以使用-o
选项仅打印与您的正则表达式匹配的任何行的部分。
If not then here's the best sed
I could come up with:如果没有,那么这是我能想到的最好的
sed
:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. ... 删除/跳过没有数字的行,并删除所有前导和尾随的非数字字符。 (I'm only guessing that your intention is to extract the number from each line that contains one).
(我只是猜测您的意图是从包含一个的每一行中提取数字)。
The problem with something like:问题类似于:
sed -e 's/.*\([0-9]*\).*/&/'
.... or .... 或者
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed
only supports "greedy" match ... so the first .* will match the rest of the line. ...
sed
只支持“贪婪”匹配...所以第一个 .* 将匹配该行的其余部分。 Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed
with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).除非我们可以使用否定字符类来实现非贪婪匹配……或者使用与 Perl 兼容的
sed
版本或其正则表达式的其他扩展,否则我们无法从模式空间中提取精确的模式匹配(a线)。
You can use awk
withmatch()
to access the captured group:您可以使用
awk
和match()
来访问捕获的组:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz
.这会尝试匹配模式
abc[0-9]+xyz
。 If it does so, it stores its slices in the array matches
, whose first item is the block [0-9]+
.如果这样做,它将其切片存储在数组
matches
,其第一项是块[0-9]+
。 Since match()
returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string) , it triggers the print
action.由于
match()
返回该子字符串开始的字符位置或索引(1,如果它从字符串的开头开始) ,它会触发print
操作。
With grep
you can use a look-behind and look-ahead:使用
grep
您可以使用后视和前瞻:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+
when it occurs within abc
and xyz
and just prints the digits.当模式
[0-9]+
出现在abc
和xyz
时,它会检查模式并只打印数字。
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature. perl 是最干净的语法,但是如果您没有 perl(我理解并不总是存在),那么使用 gawk 和正则表达式组件的唯一方法是使用 gensub 功能。
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be示例输入文件的输出将是
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.注意:gensub 替换了整个正则表达式(在 // 之间),因此您需要在 ([0-9]+) 前后放置 .* 以去除替换中数字前后的文本。
If you want to select lines then strip out the bits you don't want:如果你想选择行,然后去掉你不想要的位:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep
and then uses sed
to strip off the bits before and after the number.它基本上使用
egrep
选择您想要的行,然后使用sed
去除数字前后的位。
You can see this in action here:你可以在这里看到这个:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified.更新:显然,如果您的实际情况更复杂,则需要我修改 RE。 For example if you always had a single number buried within zero or more non-numerics at the start and end:
例如,如果您总是在开头和结尾处将一个数字埋在零个或多个非数字中:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too. OP 的案例没有指定一行可以有多个匹配项,但对于 Google 流量,我也会为此添加一个示例。
Since the OP's need is to extract a group from a pattern, using grep -o
will require 2 passes.由于 OP 需要从模式中提取一个组,因此使用
grep -o
将需要 2 次传递。 But, I still find this the most intuitive way to get the job done.但是,我仍然发现这是完成工作的最直观的方式。
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?"由于处理器时间基本上是免费的,但人类可读性是无价的,我倾向于基于这样的问题重构我的代码,“一年后,我会怎么想?” In fact, for code that I intend to share publicly or with my team, I'll even open
man grep
to figure out what the long options are and substitute those.事实上,对于我打算公开或与我的团队共享的代码,我什至会打开
man grep
来找出长选项是什么并替换它们。 Like so: grep --only-matching --extended-regexp
像这样:
grep --only-matching --extended-regexp
why even need match group为什么甚至需要匹配组
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.让 FS 收走线的两端。
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.如果 $2(FS 没有吞下的剩余部分)不包含非数字字符,那么这就是您要打印的答案。
If you're extra cautious, confirm length of $1 and $3 both being zero.如果您特别谨慎,请确认 1 美元和 3 美元的长度都为零。
** edited answer after realizing zero length $2 will trip up my previous solution ** 实现零长度后编辑的答案 $2 会绊倒我以前的解决方案
there's a standard piece of code from awk channel called " FindAllMatches
" but it's still very manual, literally, just long loops of while()
, match()
, substr()
, more substr()
, then rinse and repeat. awk 频道中有一段标准代码,称为“
FindAllMatches
”,但它仍然非常手动,字面意思是,只是while()
、 match()
、 substr()
、更多substr()
长循环,然后冲洗并重复。
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :如果您正在寻找有关如何仅获取匹配部分的想法,但是在每行匹配多次或根本不匹配的复杂正则表达式上,请尝试以下操作:
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1;
如果您还运行另一个
OFS = ""; $1 = $1;
OFS = ""; $1 = $1;
, now instead of needing 4-argument split()
or patsplit()
, both of which being gawk specific to see what the regex seps were, now the entire $0
's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0
will look EXACTLY the same as when you first read in the line. ,现在不再需要 4 个参数
split()
或patsplit()
,这两个参数都特定于查看正则表达式 sep 是什么,现在整个$0
的字段都在 data1-sep1-data2-sep2-... . 模式,..... 而$0
看起来与您第一次阅读该行时完全相同。 a straight up print
will be byte-for-byte identical to immediately printing upon reading.直接
print
将与读取时立即打印相同。
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this.一旦我使用代表有效 UTF8 字符的正则表达式对其进行了极端测试。 Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
mawk2 可能需要 30 秒左右的时间来处理一个包含大量 CJK unicode 的 167MB 文本文件,所有这些文件都一次性读入 0 美元,然后启动这个拆分逻辑,导致 NF 约为 175,000,000,每个字段都是 1-single ASCII 或多字节 UTF8 Unicode 字符。
you can do it with the shell你可以用shell来做
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk.对于 awk。 I would use the following script:
我会使用以下脚本:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.