简体   繁体   中英

How to extract nested parentheses in sed?

I am trying to extract whitespace separated columns with sed . Here is an example with ps :

$ ps | sed -n -E "s/^(\s*([^\s]+)){4}.*$/\0/p"
  PID TTY          TIME CMD
 8446 pts/185  00:00:00 ps
 8447 pts/185  00:00:00 sed
54326 pts/185  00:00:00 bash
$ ps | sed -n -E "s/^(\s*([^\s]+)){4}.*$/\1/p"
D
t
t
t

Why it does this way? How to specify nested parentheses?


I would like to get column of PIDs (in this example).


I found that I can't process non-nested parentheses either:

$ ps > out.txt
$ cat out.txt
  PID TTY          TIME CMD
14819 pts/185  00:00:00 ps
54326 pts/185  00:00:00 bash
$ cat out.txt | sed -n -E "s/^\s*([^\s]+)\s*([^\s]+)\s*([^\s]+)\s*([^\s]+).*$/\2/p"
C


$ 

In last case it prints line with C and 2 emptyy lines.

Why???

Suppose the raw file is

a1  a2 a3 a4
b1 b2 b3 b4
c1  c2 c3 c4
d1 d2 d3 d4

(If there is leading whitespace, remove it in a separate operation, 's/^ *//' )

Without extended regular expressions, you can do this:

sed 's/\([^ ][^ ]* *\)\{3\}.*/\1/'

which will yield

a3
b3
c3
d3

Extended regular expressions might make this a little cleaner, but not all implementations support backreferences, so the logic would be a little more complicated.

First, please avoid double quotes unless you want the shell to interpret it (see https://mywiki.wooledge.org/Quotes )

awk is better suited for field processing, but I'll try to provide a sed solution with explanations (assuming GNU sed as \\s is used)

$ sed -n -E 's/^(\s*([^\s]+)){4}.*$/\1/p' ip.txt
D
t
t
t
  • ^ start of line anchor
  • [^\\s] this won't work as you wanted, it will match other than \\ and s characters. \\s , \\S , \\w and \\W are not recognized by sed inside character classes, in this case you can simply use \\S though
  • (\\s*([^\\s]+)) you probably intended to capture only the field value by using two capture groups
  • {4} however, when quantifier is used, only the last match will be available for backreferencing, other matches is overridden. (further reading: https://www.regular-expressions.info/captureall.html )
  • because of \\s* string like CMD matched as multiple fields in above case
  • also, not sure why you are using -n and p instead of leaving them out

To get specific column, I'd use:

$ sed -E 's/^\s*(\S+).*/\1/' ip.txt
PID
8446
8447
54326

$ sed -E 's/^\s*\S+\s+(\S+).*/\1/' ip.txt
TTY
pts/185
pts/185
pts/185

$ sed -E 's/^\s*\S+\s+\S+\s+(\S+).*/\1/' ip.txt
TIME
00:00:00
00:00:00
00:00:00

Which gives us the following generic formula:

$ sed -E 's/^\s*(\S+\s+){0}(\S+).*/\2/' ip.txt
PID
8446
8447
54326
$ sed -E 's/^\s*(\S+\s+){1}(\S+).*/\2/' ip.txt
TTY
pts/185
pts/185
pts/185

This might work for you (GNU sed):

sed -nE 's/\S+/\n&\n/1;s/.*\n(.*)\n.*/\1/p' file

This surrounds the nth column (in this example column 1) by newlines then uses pattern matching to remove the fields and newlines either side.

Alternatively:

sed -nE 's/^(\s*(\S+)){4}.*/\2/p' file

This will return the 4th field.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM