简体   繁体   中英

Removing all occurrences of the specified substring, even overlapping ones

For example, the source string is "appleappleapplebanana" and pattern I want to delete "appleapple".

I want it to delete all "appleapple" even if they overlap, so that only "banana" is left.

appleappleapplebanana
^^^^^^^^^^              <-first  occurrence
     ^^^^^^^^^^         <-second occurrence     

If I use replaceAll, the result is "applebanana" since after deleting the first one, the remaining part is just "applebanana".

Expected results:

Input Pattern Result
"appleapplebanana" "appleapple" "banana"
"appleapplebanana" "appleapple" "banana"
"appleappleapplebanana" "appleapple" "banana"
"applebanana" "appleapple" "applebanana"
"aaabbbaaabbbaaa" "aaabbbaaa" ""(empty string)

I need to process arbitrary input patterns, so just using replace("apple") wouldn't work.

Though I have an idea for this:

  1. Get all occurences (using something like KMP)
  2. Mark corresponding characters as "to-be deleted"
  3. Delete marked characters

However, I would like to know if there is a better ( fancier ready made) way to achieve this.


I ended up making my own function using the idea above, since there seems no common libraries nor packages seems to support this feature.

The question was a bit confusing at first. After the updates I think the best provided example to illustrate the problem is matching the "pattern" aaabbbaaa in aaabbbaaabbbaaa .

aaabbbaaabbbaaa
aaabbbaaa
      aaabbbaaa
      ^-^        < overlapping part
^-------------^  < match this part: 'aaa' is overlapping

If length of the "pattern"-string may be used in the regex, a lookbehind could be used:

.{1,9}(?<=aaabbbaaa)

This regex (demo) will match from one to the strings length characters as long as aaabbbaaa is behind. So that will match aaabbbaaa but also bbbaaa because the last a is also preceded by aaabbbaaa and due to the length restriction it will not skip over any other substring. It will also match non-overlaps in aaabbbaaaaaabbbaaa but leave eg ccc in aaabbbaaacccaaabbbaaa .

A Java demo at tio.run with incorporating the length:

String regex = ".{1," + pat.length() + "}(?<=" + pat + ")";
Pattern p = Pattern.compile(regex);
String result = p.matcher(str).replaceAll("");

For longer inputs it can be more efficient to add a look ahead to start the match and wrap the lookbehind part into a repeated group with at least one repetition:

(?=aaabbbaaa)(?:.{1,9}(?<=aaabbbaaa))+

This can almost double the performance (demo) but is less efficient on shorter strings vs. without . Further you can use \w (word character) instead of the dot if input contains non-word characters.

Technically, this is over-lapping.

appleapple
     appleappleappleapple
                    appleapple

And, this is repeating.

appleapple
     appleapple
          appleapple

Although, you could refer to the latter as, having over-lapped .
Which, intrinsically, is not a property of a pattern that is considered to have a repeating quality.
It would be inherent at that point—redundant—it's just a description.

In addition to String#replace there is also String#replaceAll .
It uses a regular expression pattern as the first argument.

You could use the following pattern to replace repeating values that have over-lapped.

(apple)\1+
replaceAll("(apple)\\1+", "")

I'm not sure if there is a way to remove over-lapping values using a single pattern.
I imagine it would be much more complex.

You mentioned "... mark corresponding characters as 'to-be deleted'" .
This would most likely be the logical way to remove truly over-lapping values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM