This is an extension to the question here
Now as in the linked question, the answer used a space?
as a regex pattern to match a string with a space or no space in it.
The Problem Statement:
I have a string and an array of phrases.
input_string = 'alice is a character from a fairy tale that lived in a wonder land. A character about whome no-one knows much about'
phrases_to_remove = ['wonderland', 'character', 'noone']
Now what I want to do is to remove the last occurrences of the words in the array phrases_to_remove
from the input_string
.
output_string = 'alice is a character from a fairy tale that lived in a. A about whome knows much about'
Notice: the words to remove may or may not occur in the string and if they do, they may occur in either the same form {'wonderland' or 'character', 'noone'} or they may occur with a space or a hyphen (-) in between the words eg wonder land, no-one, character.
The issue with the code is, I can't remove the words that have a space
or a -
mismatch. For example wonder land
and wonderland
and wonder-land
.
I tried the (-)?|( )?
as a regex but couldn't get it to work.
I need help
Since you don't know where the separations can be, you could generate a regex made of ORed regexes (using word boundaries to avoid matching sub-words).
Those regexes would alternate the letters of the word and [\\s\\-]*
(matching zero to several occurrences of "space" or "dash") using str.join
on each character
import re
input_string = 'alice is a character from a fairy tale that lived in a wonder - land. A character about whome no one knows much about'
phrases_to_remove = ['wonderland', 'character', 'noone']
the_regex = "|".join(r"\b{}\b".format('[\s\-]*'.join(x)) for x in phrases_to_remove)
Now to handle the "replace everything but the first occurrence" part: let's define an object which will replace everything but the first match (using an internal counter)
class Replacer:
def __init__(self):
self.__counter = 0
def replace(self,m):
if self.__counter:
return ""
else:
self.__counter += 1
return m.group(0)
now pass the replace
method to re.sub
:
print(re.sub(the_regex,Replacer().replace,input_string))
result:
alice is a character from a fairy tale that lived in a . A about whome knows much about
(the generated regex is pretty complex BTW: \\bw[\\s\\-]*o[\\s\\-]*n[\\s\\-]*d[\\s\\-]*e[\\s\\-]*r[\\s\\-]*l[\\s\\-]*a[\\s\\-]*n[\\s\\-]*d\\b|\\bc[\\s\\-]*h[\\s\\-]*a[\\s\\-]*r[\\s\\-]*a[\\s\\-]*c[\\s\\-]*t[\\s\\-]*e[\\s\\-]*r\\b|\\bn[\\s\\-]*o[\\s\\-]*o[\\s\\-]*n[\\s\\-]*e\\b
)
The problem with your regex is grouping. Using (-)?|( )?
as a separator does not do what you think it does.
Consider what happens when the list of words is a,b
:
>>> regex = "(-)?|( )?".join(["a", "b"])
>>> regex
'a(-)?|( )?b'
You'd like this regex to match ab
or ab
or ab
, but clearly it does not do that. It matches a
, a-
, b
or <space>b
instead!
>>> re.match(regex, 'a')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, 'a-')
<_sre.SRE_Match object at 0x7f68c9f3b718>
>>> re.match(regex, 'b')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, ' b')
<_sre.SRE_Match object at 0x7f68c9f3b718>
To fix this you can enclose the separator in its own group: ([- ])?
.
If you also want to match words like wonder - land
(ie where there are spaces before/after the hyphen) you should use the following (\\s*-?\\s*)?
.
You can use one at a time:
For space:
For '-' :
^[ \t]+
@"[^0-9a-zA-Z]+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.