I need to clean up some text and am trying to remove numbers when they appear in parentheses. If there is more then that should remain.
Examples:
Foo 12 (bar, 13) -> Foo 12 (bar)
Foo 12 (13, bar, 14) -> Foo 12 (bar)
Foo (14, 13) -> Foo
I thought I would start by breaking up the string and removing numbers if they appear between parentheses but it seems that I am missing something.
echo "Foo 12 (bar, 12)" | sed 's/\(.*\)\((\)\([^0-9,].*\)\([, ].*\)\([0-9].*\)\()\)/\1\2\3\6/g'
results in Foo 12 (bar,)
.
I guess my approach is too atomic. What can I do?
If you have no problem with Perl, you could try this.
$ perl -pe 's/\s*,?\s*\b\d+\b\s*,?\s*(?=[^()]*\))//g;s/\h*\(\)$//' file
Foo 12 (bar)
Foo 12 (bar)
Foo
OR
$ perl -pe 's/(?:(?<=\()\d+,\h*|,?\h*\d+\b)(?=[^()]*\))//g;s/\h*\(\)$//' file
Foo 12 (bar)
Foo 12 (bar)
Foo
Here's a general approach for problems like this, where you want to isolate a specific token and work on it, adapted for your problem:
#!/bin/sed -f
:loop # while the line has a matching token
/([^)]*[0-9]\+[^)])/ {
s//\n&\n/ # mark it -- \n is good as a marker because it is
# nowhere else in the line
h # hold the line!
s/.*\n\(.*\)\n.*/\1/ # isolate the token
s/[0-9]\+,\s*//g # work on the token. Here this removes all numbers
s/,\s*[0-9]\+//g # with or without commas in front or behind
s/\s*[0-9]\+\s*//g
s/\s*()// # and also empty parens if they exist after all that.
G # get the line back
# and replace the marked token with the result of the
# transformation
s/\(.*\)\n\(.*\)\n.*\n\(.*\)/\2\1\3/
b loop # then loop to get all such tokens.
}
To those who argue that this goes beyond the scope of what should reasonably be done with sed I say: True, but...well, true. But if all you see is nails, this is a way to make sed into a sledgehammer.
This can of course be written inline (although that does not help readability):
echo 'Foo 12 (bar, 12)' | sed ':loop;/([^)]*[0-9]\+[^)])/{;s//\n&\n/;h;s/.*\n\(.*\)\n.*/\1/;s/[0-9]\+,\s*//g;s/,\s*[0-9]\+//g;s/\s*[0-9]\+\s*//g;s/\s*()//;G;s/\(.*\)\n\(.*\)\n.*\n\(.*\)/\2\1\3/;b loop}'
but my advice is to put it into a file and run echo 'Foo 12 (bar, 12)' | sed -f foo.sed
echo 'Foo 12 (bar, 12)' | sed -f foo.sed
. Or, with the shebang like above, chmod +x foo.sed
and echo 'Foo 12 (bar, 12)' | ./foo.sed
echo 'Foo 12 (bar, 12)' | ./foo.sed
.
I have not benchmarked this, by the way. I imagine that it is not the most efficient way to process large amounts of data.
EDIT: In response to the comments: I'm not sure what OP wants in such cases, but for the sake of completion, the basic pattern could be adapted for the other behavior like this:
#!/bin/sed -f
:loop
/(\s*[0-9]\+\s*)\|(\s*[0-9]\+\s*,[^)]*)\|([^)]*,\s*[0-9]\+\s*)\|([^)]*,\s*[0-9]\+\s*,[^)]*)/ {
s//\n&\n/
h
s/.*\n\(.*\)\n.*/\1/
s/,\s*[0-9]\+\s*,/,/g
s/(\s*[0-9]\+\s*,\s*/(/
s/\s*,\s*[0-9]\+\s*)/)/
s/\s*(\s*[0-9]*\s*)//
G
s/\(.*\)\n\(.*\)\n.*\n\(.*\)/\2\1\3/
b loop
}
The regex at the top looks a lot scarier now. It should help to know that it consists of the four subpatterns
(\s*[0-9]\+\s*)
(\s*[0-9]\+\s*,[^)]*)
([^)]*,\s*[0-9]\+\s*)
([^)]*,\s*[0-9]\+\s*,[^)]*)
which are or-ed together with \\|
. This should cover all cases and not match things like foo12
, 12bar
, and foo12bar
in parentheses (unless there's a standalone number in them as well).
Here is an awk
version:
awk -F' *\\(|\\)' '{for (i=2;i<=NF;i+=2) {n=split($i,a," *, *");f="";for (j=1;j<=n;j++) f=f (a[j]!~/[[:digit:]]/?a[j]",":""); $i=f?"("f")":"";sub(/,)/,")",$i)}}1' file
Foo 12 (bar)
Foo 12 (bar)
Foo
cat file
Foo 12 (bar, 13, more)
Foo 12 (13, bar, 14) (434, tar ,56)
Foo (14, 13)
awk -F' *\\(|\\)' '{for (i=2;i<=NF;i+=2) {n=split($i,a," *, *");f="";for (j=1;j<=n;j++) f=f (a[j]!~/[[:digit:]]/?a[j]",":""); $i=f?"("f")":"";sub(/,)/,")",$i)}}1' file
Foo 12 (bar,more)
Foo 12 (bar) (tar)
Foo
Some more readable:
awk -F' *\\(|\\)' '
{
for (i=2;i<=NF;i+=2) {
n=split($i,a," *, *")
f=""
for (j=1;j<=n;j++)
f=f (a[j]!~/[[:digit:]]/?a[j]",":"")
$i=f?"("f")":""
sub(/,)/,")",$i)
}
}
1' file
sed ':retry
# remove "( number )"
s/( *[0-9]* *)//
# remove first ", number" (not at first place)
s/^\(\([^(]*([^(]*)\)*[^(]*([^)]*\), *[0-9]\{1,\} *\([,)]\)/\1\3/
t retry
# remove " number" (first place)
s/^\(\([^(]*([^(]*)\)*[^(]*(\) *[0-9]\{1,\}\(,\{0,1\}\)\()\{0,1\}\)]*/\1\4/
# case needed where only "( number)" or "()" are the result at this moment
t retry
' YourFile
--POSIX
on GNU sed)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.