简体   繁体   中英

sed replace whitespace with underscore between 2 strings

I have a file that contains lines like this

some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>

I need to replace all the spaces between <phrase> tags with an underscore. So basically I need to replace every space that falls between > and </ with an underscore. I've tried many different commands in sed, awk, and perl but haven't been able to get anything to work. Below are some of the commands I've tried.

sed 's@>\\s+[</]@_@g'

perl -pe 'sub c{$s=shift;$s=~s/ /_/g;$s}s/>.*?[<\\/]/c$&/ge'

sed 's@\\(\\[>^[<\\/]]*\\)\\s+@\\1_@g'

awk -v RS='\\\\[>^[<\\]/]*\\\\]' '{ gsub(/\\<(\\s+)\\>/, "_", RT); printf "%s%s", $0, RT }' infile

I've been looking at these 2 questions trying to modify the answers to use the characters I need.
sed substitute whitespace for dash only between specific character patterns

https://unix.stackexchange.com/questions/63335/how-to-remove-all-white-spaces-just-between-brackets-using-unix-tools

Can anyone please help?

Don't use regular expressions to parse XML/HTML.

use warnings;
use 5.014;  # for /r modifier
use Mojo::DOM;

my $text = <<'ENDTEXT';
some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>
ENDTEXT

my $dom = Mojo::DOM->new($text);
$dom->find('phrase')->each(sub { $_->content( $_->content=~tr/ /_/r ) });
print $dom;

Output:

some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

Update: Mojolicious even contains some sugar that allows smashing that code into a oneliner:

$ perl -Mojo -pe '($_=x($_))->find("phrase")->each(sub{$_->content($_->content=~tr/ /_/r)})' input.txt

I need to replace every space that falls between > and </ with an underscore.

That won't actually do what you want because eg in

some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>
                  ^^^^^^^^^^^      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

the substrings "between > and </ " cover more than you think (marked ^ above).

I think the most straightforward way to express your requirements in Perl is

perl -pe 's{>[^<>]*</}{ $& =~ tr/ /_/r }eg'

Here [^<>] is used to make sure that the matched substring cannot contain < or > (in particular, it cannot match other <phrase> tags).

If that's too readable, you can also do

perl '-pes;>[^<>]*</;$&=~y> >_>r;eg'

This might work for you (GNU sed):

sed -E 's/<phrase>|<\/phrase>/\n&/g;ta;:a;s/^([^\n]*(\n[^\n ]*\n[^\n]*)*\n[^\n]*) /\1_/;ta;s/\n//g' file

Delimit tags by inserting newlines. Iteratively substitute spaces between pairs of newlines with underscores. When there are no more matches, remove the introduced newlines.

Another Perl, replacing between the <phrase> tags

$ export a="some thing <phrase>a phrase</phrase> some thing else <phrase>other stuff</phrase>"

$ echo $a | perl -lne ' s/(?<=<phrase>)(.+?)(?=<\/phrase>)/$x=$1;$x=~s{ }{_}g;sprintf("%s",$x)/ge ;  print '
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

$

EDIT

Thanks @haukex, shortening further

$ echo $a | perl -lne ' s/(?<=<phrase>)(.+?)(?=<\/phrase>)/$x=$1;$x=~s{ }{_}g;$x/ge ;  print '
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

$

With GNU awk for multi-char RS and RT:

$ awk -v RS='</?phrase>' '!(NR%2){gsub(/\s+/,"_")} {ORS=RT}1' file
some thing <phrase>a_phrase</phrase> some thing else <phrase>other_stuff</phrase>

如果你的数据在'd'中由gnu sed;

sed -E ':b s/<(\w+)>([^<]*)\s([^<]*)(<\/\1)/<\1>\2_\3\4/;tb' d

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM