简体   繁体   中英

Remove words with spaces or “-” in them Python

This is an extension to the question here

Now as in the linked question, the answer used a space? as a regex pattern to match a string with a space or no space in it.

The Problem Statement:

I have a string and an array of phrases.

input_string = 'alice is a character from a fairy tale that lived in a wonder land. A character about whome no-one knows much about'

phrases_to_remove = ['wonderland', 'character', 'noone']

Now what I want to do is to remove the last occurrences of the words in the array phrases_to_remove from the input_string .

output_string = 'alice is a character from a fairy tale that lived in a. A about whome knows much about'

Notice: the words to remove may or may not occur in the string and if they do, they may occur in either the same form {'wonderland' or 'character', 'noone'} or they may occur with a space or a hyphen (-) in between the words eg wonder land, no-one, character.

The issue with the code is, I can't remove the words that have a space or a - mismatch. For example wonder land and wonderland and wonder-land .

I tried the (-)?|( )? as a regex but couldn't get it to work.

I need help

Since you don't know where the separations can be, you could generate a regex made of ORed regexes (using word boundaries to avoid matching sub-words).

Those regexes would alternate the letters of the word and [\\s\\-]* (matching zero to several occurrences of "space" or "dash") using str.join on each character

import re

input_string = 'alice is a character from a fairy tale that lived in a wonder - land. A character about whome no one knows much about'

phrases_to_remove = ['wonderland', 'character', 'noone']

the_regex = "|".join(r"\b{}\b".format('[\s\-]*'.join(x)) for x in phrases_to_remove)

Now to handle the "replace everything but the first occurrence" part: let's define an object which will replace everything but the first match (using an internal counter)

class Replacer:
    def __init__(self):
        self.__counter = 0

    def replace(self,m):
        if self.__counter:
            return ""
        else:
            self.__counter += 1
            return m.group(0)

now pass the replace method to re.sub :

print(re.sub(the_regex,Replacer().replace,input_string))

result:

alice is a character from a fairy tale that lived in a . A  about whome  knows much about

(the generated regex is pretty complex BTW: \\bw[\\s\\-]*o[\\s\\-]*n[\\s\\-]*d[\\s\\-]*e[\\s\\-]*r[\\s\\-]*l[\\s\\-]*a[\\s\\-]*n[\\s\\-]*d\\b|\\bc[\\s\\-]*h[\\s\\-]*a[\\s\\-]*r[\\s\\-]*a[\\s\\-]*c[\\s\\-]*t[\\s\\-]*e[\\s\\-]*r\\b|\\bn[\\s\\-]*o[\\s\\-]*o[\\s\\-]*n[\\s\\-]*e\\b )

The problem with your regex is grouping. Using (-)?|( )? as a separator does not do what you think it does.

Consider what happens when the list of words is a,b :

>>> regex = "(-)?|( )?".join(["a", "b"])
>>> regex
'a(-)?|( )?b'

You'd like this regex to match ab or ab or ab , but clearly it does not do that. It matches a , a- , b or <space>b instead!

>>> re.match(regex, 'a')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, 'a-')
<_sre.SRE_Match object at 0x7f68c9f3b718>
>>> re.match(regex, 'b')
<_sre.SRE_Match object at 0x7f68c9f3b690>
>>> re.match(regex, ' b')
<_sre.SRE_Match object at 0x7f68c9f3b718>

To fix this you can enclose the separator in its own group: ([- ])? .

If you also want to match words like wonder - land (ie where there are spaces before/after the hyphen) you should use the following (\\s*-?\\s*)? .

You can use one at a time:

For space:

For '-' :

^[ \t]+
@"[^0-9a-zA-Z]+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM