Removing all occurrences of the specified substring, even overlapping ones

Question

For example, the source string is "appleappleapplebanana" and pattern I want to delete "appleapple".

I want it to delete all "appleapple" even if they overlap, so that only "banana" is left.

appleappleapplebanana
^^^^^^^^^^              <-first  occurrence
     ^^^^^^^^^^         <-second occurrence

If I use replaceAll, the result is "applebanana" since after deleting the first one, the remaining part is just "applebanana".

Expected results:

Input	Pattern	Result
"appleapplebanana"	"appleapple"	"banana"
"appleapplebanana"	"appleapple"	"banana"
"appleappleapplebanana"	"appleapple"	"banana"
"applebanana"	"appleapple"	"applebanana"
"aaabbbaaabbbaaa"	"aaabbbaaa"	""(empty string)

I need to process arbitrary input patterns, so just using replace("apple") wouldn't work.

Though I have an idea for this:

Get all occurences (using something like KMP)
Mark corresponding characters as "to-be deleted"
Delete marked characters

However, I would like to know if there is a better ( ~~fancier~~ ready made) way to achieve this.

I ended up making my own function using the idea above, since there seems no common libraries nor packages seems to support this feature.

Answer 1

The question was a bit confusing at first. After the updates I think the best provided example to illustrate the problem is matching the "pattern" aaabbbaaa in aaabbbaaabbbaaa .

aaabbbaaabbbaaa
aaabbbaaa
      aaabbbaaa
      ^-^        < overlapping part
^-------------^  < match this part: 'aaa' is overlapping

If length of the "pattern"-string may be used in the regex, a lookbehind could be used:

.{1,9}(?<=aaabbbaaa)

This regex (demo) will match from one to the strings length characters as long as aaabbbaaa is behind. So that will match aaabbbaaa but also bbbaaa because the last a is also preceded by aaabbbaaa and due to the length restriction it will not skip over any other substring. It will also match non-overlaps in aaabbbaaaaaabbbaaa but leave eg ccc in aaabbbaaacccaaabbbaaa .

A Java demo at tio.run with incorporating the length:

String regex = ".{1," + pat.length() + "}(?<=" + pat + ")";
Pattern p = Pattern.compile(regex);
String result = p.matcher(str).replaceAll("");

For longer inputs it can be more efficient to add a look ahead to start the match and wrap the lookbehind part into a repeated group with at least one repetition:

(?=aaabbbaaa)(?:.{1,9}(?<=aaabbbaaa))+

This can almost double the performance (demo) but is less efficient on shorter strings vs. without . Further you can use \w (word character) instead of the dot if input contains non-word characters.

Answer 2

Technically, this is over-lapping.

appleapple
     appleappleappleapple
                    appleapple

And, this is repeating.

appleapple
     appleapple
          appleapple

Although, you could refer to the latter as, having over-lapped .
Which, intrinsically, is not a property of a pattern that is considered to have a repeating quality.
It would be inherent at that point—redundant—it's just a description.

In addition to String#replace there is also String#replaceAll .
It uses a regular expression pattern as the first argument.

You could use the following pattern to replace repeating values that have over-lapped.

(apple)\1+

replaceAll("(apple)\\1+", "")

I'm not sure if there is a way to remove over-lapping values using a single pattern.
I imagine it would be much more complex.

You mentioned "... mark corresponding characters as 'to-be deleted'" .
This would most likely be the logical way to remove truly over-lapping values.

Removing all occurrences of the specified substring, even overlapping ones

Question

2 answers

solution1
1 2023-06-03 10:42:09

solution2
0 2023-06-03 03:47:52

Removing all occurrences of the specified substring, even overlapping ones

Question

2 answers

solution1 1 2023-06-03 10:42:09

solution2 0 2023-06-03 03:47:52

solution1
1 2023-06-03 10:42:09

solution2
0 2023-06-03 03:47:52