sed replace whitespace with underscore between 2 strings

Question

I have a file that contains lines like this

some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>

I need to replace all the spaces between <phrase> tags with an underscore. So basically I need to replace every space that falls between > and </ with an underscore. I've tried many different commands in sed, awk, and perl but haven't been able to get anything to work. Below are some of the commands I've tried.

sed 's@>\\s+[</]@_@g'

perl -pe 'sub c{$s=shift;$s=~s/ /_/g;$s}s/>.*?[<\\/]/c$&/ge'

sed 's@\$\\[>^[<\\/]]*\$\\s+@\\1_@g'

awk -v RS='\\\\[>^[<\\]/]*\\\\]' '{ gsub(/\\<(\\s+)\\>/, "_", RT); printf "%s%s", $0, RT }' infile

I've been looking at these 2 questions trying to modify the answers to use the characters I need.
sed substitute whitespace for dash only between specific character patterns

https://unix.stackexchange.com/questions/63335/how-to-remove-all-white-spaces-just-between-brackets-using-unix-tools

Can anyone please help?

Answer 1

Don't use regular expressions to parse XML/HTML.

use warnings;
use 5.014;  # for /r modifier
use Mojo::DOM;

my $text = <<'ENDTEXT';
some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>
ENDTEXT

my $dom = Mojo::DOM->new($text);
$dom->find('phrase')->each(sub { $_->content( $_->content=~tr/ /_/r ) });
print $dom;

Output:

some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

Update: Mojolicious even contains some sugar that allows smashing that code into a oneliner:

$ perl -Mojo -pe '($_=x($_))->find("phrase")->each(sub{$_->content($_->content=~tr/ /_/r)})' input.txt

Answer 2

I need to replace every space that falls between > and </ with an underscore.

That won't actually do what you want because eg in

some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>
                  ^^^^^^^^^^^      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

the substrings "between > and </ " cover more than you think (marked ^ above).

I think the most straightforward way to express your requirements in Perl is

perl -pe 's{>[^<>]*</}{ $& =~ tr/ /_/r }eg'

Here [^<>] is used to make sure that the matched substring cannot contain < or > (in particular, it cannot match other <phrase> tags).

If that's too readable, you can also do

perl '-pes;>[^<>]*</;$&=~y> >_>r;eg'

Answer 3

This might work for you (GNU sed):

sed -E 's/<phrase>|<\/phrase>/\n&/g;ta;:a;s/^([^\n]*(\n[^\n ]*\n[^\n]*)*\n[^\n]*) /\1_/;ta;s/\n//g' file

Delimit tags by inserting newlines. Iteratively substitute spaces between pairs of newlines with underscores. When there are no more matches, remove the introduced newlines.

Answer 4

Another Perl, replacing between the <phrase> tags

$ export a="some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>"

$ echo $a | perl -lne ' s/(?<=<phrase>)(.+?)(?=<\/phrase>)/$x=$1;$x=~s{ }{_}g;sprintf("%s",$x)/ge ;  print '
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

$

EDIT

Thanks @haukex, shortening further

$ echo $a | perl -lne ' s/(?<=<phrase>)(.+?)(?=<\/phrase>)/$x=$1;$x=~s{ }{_}g;$x/ge ;  print '
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

$

Answer 5

With GNU awk for multi-char RS and RT:

$ awk -v RS='</?phrase>' '!(NR%2){gsub(/\s+/,"_")} {ORS=RT}1' file
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

Answer 6

如果你的数据在'd'中由gnu sed;

sed -E ':b s/<(\w+)>([^<]*)\s([^<]*)(<\/\1)/<\1>\2_\3\4/;tb' d

sed replace whitespace with underscore between 2 strings

Question

6 answers

solution1
5 2019-02-09 22:58:28

solution2
2 ACCPTED 2019-02-09 22:42:08

solution3
2 2019-02-10 00:56:48

solution4
1 2019-02-09 23:28:17

solution5
1 2019-02-10 04:51:56

solution6
1 2019-04-09 10:18:17

sed replace whitespace with underscore between 2 strings

Question

6 answers

solution1 5 2019-02-09 22:58:28

solution2 2 ACCPTED 2019-02-09 22:42:08

solution3 2 2019-02-10 00:56:48

solution4 1 2019-02-09 23:28:17

solution5 1 2019-02-10 04:51:56

solution6 1 2019-04-09 10:18:17

solution1
5 2019-02-09 22:58:28

solution2
2 ACCPTED 2019-02-09 22:42:08

solution3
2 2019-02-10 00:56:48

solution4
1 2019-02-09 23:28:17

solution5
1 2019-02-10 04:51:56

solution6
1 2019-04-09 10:18:17