简体   繁体   中英

Regex to find(/replace) multiple instances of character in string

I have a (probably very basic) question about how to construct a (perl) regex, perl -pe 's///g;' , that would find/replace multiple instances of a given character/set of characters in a specified string. Initially, I thought the g "global" flag would do this, but I'm clearly misunderstanding something very central here. :/

For example, I want to eliminate any non-alphanumeric characters in a specific string (within a larger text corpus). Just by way of example, the string is identified by starting with [ followed by @, possibly with some characters in between.

[abc@def"ghi"jkl'123]

The following regex

s/(\[[^\[\]]*?@[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1$2/g;

will find the first " and if I run it three times I have all three. Similarly, what if I want to replace the non-alphanumeric characters with something else, let's say an X.

s/(\[[^\[\]]*?@[^\[\]]*?)[^a-zA-Z0-9]+?([^\[\]]*?)/$1X$2/g; 

does the trick for one instance. But how can I find all of them in one go?

The reason your code doesn't work is that /g doesn't rescan the string after a substitution. It finds all non-overlapping matches of the given regex and then substitutes the replacement part in.

In [abc@def"ghi"jkl'123] , there is only a single match (which is the [abc@def" part of the string, with $1 = '[abc@def' and $2 = '' ), so only the first " is removed.

After the first match, Perl scans the remaining string ( ghi"jkl'123] ) for another match, but it doesn't find another [ (or @ ).


I think the most straightforward solution is to use a nested search/replace operation. The outer match identifies the string within which to substitute, and the inner match does the actual replacement.

In code:

s{ \[ [^\[\]\@]* \@ \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9//cdr }xe;

Or to replace each match by X :

s{ \[ [^\[\]\@]* \@ \K ([^\[\]]*) (?= \] ) }{ $1 =~ tr/a-zA-Z0-9/X/cr }xe;

We match a prefix of [ , followed by 0 or more characters that are not [ or ] or @ , followed by @ .

\\K is used to mark the virtual beginning of the match (ie everything matched so far is not included in the matched string, which simplifies the substitution).

We match and capture 0 or more characters that are not [ or ] .

Finally we match a suffix of ] in a look-ahead (so it's not part of the matched string either).

The replacement part is executed as a piece of code, not a string (as indicated by the /e flag). Here we could have used $1 =~ s/[^a-zA-Z0-9]//gr or $1 =~ s/[^a-zA-Z0-9]/X/gr , respectively, but since each inner match is just a single character, it's also possible to use a transliteration.

We return the modified string (as indicated by the /r flag) and use it as the replacement in the outer s operation.

So...I'm going to suggest a marvelously computationally inefficient approach to this. Marvelously inefficient, but possibly still faster than a variable-length lookbehind would be...and also easy (for you):

The \\K causes everything before it to be dropped....so only the character after it is actually replaced.

perl -pe 'while (s/\[[^]]*@[^]]*\K[^]a-zA-Z0-9]//){}' file

Basically we just have an empty loop that executes until the search and replace replaces nothing.

Slightly improved version:

perl -pe 'while (s/\[[^]]*?@[^]]*?\K[^]a-zA-Z0-9](?=[^]]*?])//){}' file

The (?=) verifies that its content exists after the match without being part of the match. This is a variable-length lookahead (what we're missing going the other direction). I also made the * s lazy with the ? so we get the shortest match possible.

Here is another approach. Capture precisely the substring that needs work, and in the replacement part run a regex on it that cleans it of non-alphanumeric characters

use warnings;
use strict;
use feature 'say';

my $var = q(ah [abc@def"ghi"jkl'123] oh); #'
say $var;

$var =~ s{ \[ [^\[\]]*? \@\K ([^\]]+) }{
    (my $v = $1) =~ s{[^0-9a-zA-Z]}{}g;
    $v
}ex;

say $var;

where the lone $v is needed so to return that and not the number of matches, what s/ operator itself returns. This can be improved by using the /r modifier, which returns the changed string and doesn't change the original (so it doesn't attempt to change $1 , what isn't allowed)

$var =~ s{ \[ [^\[\]]*? \@\K ([^\]]+) }{
    $1 =~ s/[^0-9a-zA-Z]//gr;
}ex;

The \\K is there so that all matches before it are "dropped" -- they are not consumed so we don't need to capture them in order to put them back. The /e modifier makes the replacement part be evaluated as code.

The code in the question doesn't work because everything matched is consumed, and (under /g ) the search continues from the position after the last match, attempting to find that whole pattern again further down the string. That fails and only that first occurrence is replaced.

The problem with matches that we want to leave in the string can often be remedied by \\K (used in all current answers), which makes it so that all matches before it are not consumed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM