如何使用 sed、awk 或 gawk 仅打印匹配的内容？

Question

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.我看到很多关于如何使用 sed、awk 或 gawk 执行搜索和替换等操作的示例和手册页。

But in my case, I have a regular expression that I want to run against a text file to extract a specific value.但就我而言，我有一个正则表达式，我想针对文本文件运行该表达式以提取特定值。 I don't want to do search-and-replace.我不想进行搜索和替换。 This is being called from bash.这是从 bash 调用的。 Let's use an example:让我们举一个例子：

Example regular expression:正则表达式示例：

.*abc([0-9]+)xyz.*

Example input file:示例输入文件：

a
b
c
abc12345xyz
a
b
c

As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly.这听起来很简单，但我无法弄清楚如何正确调用 sed/awk/gawk。 What I was hoping to do, is from within my bash script have:我希望做的是在我的 bash 脚本中有：

myvalue=$( sed <...something...> input.txt )

Things I've tried include:我尝试过的事情包括：

sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing

Answer 1

My sed (Mac OS X) didn't work with + .我的sed (Mac OS X) 不适用于+ 。 I tried * instead and I added p tag for printing match:我尝试了*并添加了p标签来打印匹配：

sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt

For matching at least one numeric character without + , I would use:为了匹配至少一个没有+数字字符，我会使用：

sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt

Answer 2

You can use sed to do this您可以使用 sed 来执行此操作

 sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'

-n don't print the resulting line -n不打印结果行
-r this makes it so you don't have the escape the capture group parens () . -r这使得你没有转义捕获组 parens () 。
\\1 the capture group match \\1捕获组匹配
/g global match /g全局匹配
/p print the result /p打印结果

I wrote a tool for myself that makes this easier我为自己写了一个工具，使这更容易

rip 'abc(\d+)xyz' '$1'

Answer 3

I use perl to make this easier for myself.我使用perl使这对我自己更容易。 eg例如

perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'

This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code.这将运行 Perl， -n选项指示 Perl 从 STDIN 一次读取一行并执行代码。 The -e option specifies the instruction to run. -e选项指定要运行的指令。

The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ( $1 ).该指令在读取的行上运行正则表达式，如果匹配，则打印出第一组括号 ( $1 ) 的内容。

You can do this will multiple file names on the end also.您也可以这样做，最后也会有多个文件名。 eg例如

perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt

Answer 4

If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.如果您的grep版本支持它，您可以使用-o选项仅打印与您的正则表达式匹配的任何行的部分。

If not then here's the best sed I could come up with:如果没有，那么这是我能想到的最好的sed ：

sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'

... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. ... 删除/跳过没有数字的行，并删除所有前导和尾随的非数字字符。 (I'm only guessing that your intention is to extract the number from each line that contains one). （我只是猜测您的意图是从包含一个的每一行中提取数字）。

The problem with something like:问题类似于：

sed -e 's/.*\([0-9]*\).*/&/'

.... or .... 或者

sed -e 's/.*\([0-9]*\).*/\1/'

... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. ... sed只支持“贪婪”匹配...所以第一个 .* 将匹配该行的其余部分。 Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).除非我们可以使用否定字符类来实现非贪婪匹配……或者使用与 Perl 兼容的sed版本或其正则表达式的其他扩展，否则我们无法从模式空间中提取精确的模式匹配（a线）。

Answer 5

You can use awk withmatch() to access the captured group:您可以使用awk和match()来访问捕获的组：

$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345

This tries to match the pattern abc[0-9]+xyz .这会尝试匹配模式abc[0-9]+xyz 。 If it does so, it stores its slices in the array matches , whose first item is the block [0-9]+ .如果这样做，它将其切片存储在数组matches ，其第一项是块[0-9]+ 。 Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string) , it triggers the print action.由于match()返回该子字符串开始的字符位置或索引（1，如果它从字符串的开头开始） ，它会触发print操作。

With grep you can use a look-behind and look-ahead:使用grep您可以使用后视和前瞻：

$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345

$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345

This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.当模式[0-9]+出现在abc和xyz时，它会检查模式并只打印数字。

Answer 6

perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature. perl 是最干净的语法，但是如果您没有 perl（我理解并不总是存在），那么使用 gawk 和正则表达式组件的唯一方法是使用 gensub 功能。

gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file

output of the sample input file will be示例输入文件的输出将是

Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.注意：gensub 替换了整个正则表达式（在 // 之间），因此您需要在 ([0-9]+) 前后放置 .* 以去除替换中数字前后的文本。

Answer 7

If you want to select lines then strip out the bits you don't want:如果你想选择行，然后去掉你不想要的位：

egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'

It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.它基本上使用egrep选择您想要的行，然后使用sed去除数字前后的位。

You can see this in action here:你可以在这里看到这个：

pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>

Update: obviously if you actual situation is more complex, the REs will need to me modified.更新：显然，如果您的实际情况更复杂，则需要我修改 RE。 For example if you always had a single number buried within zero or more non-numerics at the start and end:例如，如果您总是在开头和结尾处将一个数字埋在零个或多个非数字中：

egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'

Answer 8

The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too. OP 的案例没有指定一行可以有多个匹配项，但对于 Google 流量，我也会为此添加一个示例。

Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes.由于 OP 需要从模式中提取一个组，因此使用grep -o将需要 2 次传递。 But, I still find this the most intuitive way to get the job done.但是，我仍然发现这是完成工作的最直观的方式。

$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT

$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz

$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512

Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?"由于处理器时间基本上是免费的，但人类可读性是无价的，我倾向于基于这样的问题重构我的代码，“一年后，我会怎么想？” In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those.事实上，对于我打算公开或与我的团队共享的代码，我什至会打开man grep来找出长选项是什么并替换它们。 Like so: grep --only-matching --extended-regexp像这样： grep --only-matching --extended-regexp

Answer 9

why even need match group为什么甚至需要匹配组

gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'

Let FS collect away both ends of the line.让 FS 收走线的两端。

If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.如果 $2（FS 没有吞下的剩余部分）不包含非数字字符，那么这就是您要打印的答案。

If you're extra cautious, confirm length of $1 and $3 both being zero.如果您特别谨慎，请确认 1 美元和 3 美元的长度都为零。

** edited answer after realizing zero length $2 will trip up my previous solution ** 实现零长度后编辑的答案 $2 会绊倒我以前的解决方案

Answer 10

there's a standard piece of code from awk channel called " FindAllMatches " but it's still very manual, literally, just long loops of while() , match() , substr() , more substr() , then rinse and repeat. awk 频道中有一段标准代码，称为“ FindAllMatches ”，但它仍然非常手动，字面意思是，只是while() 、 match() 、 substr() 、更多substr()长循环，然后冲洗并重复。

If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :如果您正在寻找有关如何仅获取匹配部分的想法，但是在每行匹配多次或根本不匹配的复杂正则表达式上，请尝试以下操作：

mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) { 

    alnumstr = sprintf("%s%c", alnumstr , x) 
 }; 
 gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr) 
                       
                    # resulting str should be 44-chars long :
                    # all digits, non-vowels, equal sign =, and underscore _

 x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)

 } while ( --x );   # you can pick any level of precision you need.
                    # 10 chars randomly among the set is approx. 54-bits 
                    #
                    # i prefer this set over all ASCII being these 
                    # just about never require escaping 
                    # feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
                    #
                    # now you've made a random nonce that can be 
                    # inserted right in the middle of just about ANYTHING
                    # -- ASCII, Unicode, binary data -- (1) which will always fully
                    # print out, (2) has extremely low chance of actually
                    # appearing inside any real word data, and (3) even lower chance
                    # it accidentally alters the meaning of the underlying data.
                    # (so intentionally leaving them in there and 
                    # passing it along unix pipes remains quite harmless)
                    #
                    # this is essentially the lazy man's approach to making nonces
                    # that kinda-sorta have some resemblance to base64
                    # encoded, without having to write such a module (unless u have
                    # one for awk handy)


    regex1 = (..);  # build whatever regex you want here

    FS = OFS = nonceFS;

 } $0 ~ regex1 { 

    gsub(regex1, nonceFS "&" nonceFS); $0 = $0;  

                   # now you've essentially replicated what gawk patsplit( ) does,
                   # or gawk's split(..., seps) tracking 2 arrays one for the data
                   # in between, and one for the seps.
                   #
                   # via this method, that can all be done upon the entire $0,
                   # without any of the hassle (and slow downs) of 
                   # reading from associatively-hashed arrays,
                   # 
                   # simply print out all your even numbered columns
                   # those will be the parts of "just the match"

if you also run another OFS = ""; $1 = $1;如果您还运行另一个OFS = ""; $1 = $1; OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit() , both of which being gawk specific to see what the regex seps were, now the entire $0 's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. ，现在不再需要 4 个参数split()或patsplit() ，这两个参数都特定于查看正则表达式 sep 是什么，现在整个$0的字段都在 data1-sep1-data2-sep2-... . 模式，..... 而$0看起来与您第一次阅读该行时完全相同。 a straight up print will be byte-for-byte identical to immediately printing upon reading.直接print将与读取时立即打印相同。

Once i tested it to the extreme using a regex that represents valid UTF8 characters on this.一旦我使用代表有效 UTF8 字符的正则表达式对其进行了极端测试。 Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode. mawk2 可能需要 30 秒左右的时间来处理一个包含大量 CJK unicode 的 167MB 文本文件，所有这些文件都一次性读入 0 美元，然后启动这个拆分逻辑，导致 NF 约为 175,000,000，每个字段都是 1-single ASCII 或多字节 UTF8 Unicode 字符。

Answer 11

you can do it with the shell你可以用shell来做

while read -r line
do
    case "$line" in
        *abc*[0-9]*xyz* ) 
            t="${line##abc}"
            echo "num is ${t%%xyz}";;
    esac
done <"file"

Answer 12

For awk.对于 awk。 I would use the following script:我会使用以下脚本：

/.*abc([0-9]+)xyz.*/ {
            print $0;
            next;
            }
            {
            /* default, do nothing */
            }

Answer 13

gawk '/.*abc([0-9]+)xyz.*/' file

如何使用 sed、awk 或 gawk 仅打印匹配的内容？

问题描述

13 个解决方案

解决方案1
44 已采纳 2009-11-14 08:50:20

解决方案2
38 2016-02-03 19:39:12

解决方案3
18 2009-11-14 08:44:04

解决方案4
5 2009-11-14 10:56:46

解决方案5
4 2016-08-22 09:01:11

解决方案6
2 2013-04-29 20:21:53

解决方案7
1 2009-11-14 08:46:20

解决方案8
1 2019-10-09 16:11:01

解决方案9
0 2021-02-04 00:16:32

解决方案10
0 2021-05-05 16:38:56

解决方案11
-1 2009-11-28 01:58:22

解决方案12
-3 2009-11-14 08:54:58

解决方案13
-3 2009-11-14 09:18:02

如何使用 sed、awk 或 gawk 仅打印匹配的内容？

问题描述

13 个解决方案

解决方案1 44 已采纳 2009-11-14 08:50:20

解决方案2 38 2016-02-03 19:39:12

解决方案3 18 2009-11-14 08:44:04

解决方案4 5 2009-11-14 10:56:46

解决方案5 4 2016-08-22 09:01:11

解决方案6 2 2013-04-29 20:21:53

解决方案7 1 2009-11-14 08:46:20

解决方案8 1 2019-10-09 16:11:01

解决方案9 0 2021-02-04 00:16:32

解决方案10 0 2021-05-05 16:38:56

解决方案11 -1 2009-11-28 01:58:22

解决方案12 -3 2009-11-14 08:54:58

解决方案13 -3 2009-11-14 09:18:02

解决方案1
44 已采纳 2009-11-14 08:50:20

解决方案2
38 2016-02-03 19:39:12

解决方案3
18 2009-11-14 08:44:04

解决方案4
5 2009-11-14 10:56:46

解决方案5
4 2016-08-22 09:01:11

解决方案6
2 2013-04-29 20:21:53

解决方案7
1 2009-11-14 08:46:20

解决方案8
1 2019-10-09 16:11:01

解决方案9
0 2021-02-04 00:16:32

解决方案10
0 2021-05-05 16:38:56

解决方案11
-1 2009-11-28 01:58:22

解决方案12
-3 2009-11-14 08:54:58

解决方案13
-3 2009-11-14 09:18:02