简体   繁体   中英

How to remove consecutive identical words from a string in python

I have a string as follows where I need to remove similar consecutive words.

mystring = "my friend's new new new new and old old cats are running running in the street"

My output should look as follows.

myoutput = "my friend's new and old cats are running in the street"

I am using the following python code to do it.

 mylist = []
 for i, w in enumerate(mystring.split()):
     for n, l in enumerate(mystring.split()):
             if l != w and i == n-1:
                     mylist.append(w)
 mylist.append(mystring.split()[-1])
 myoutput = " ".join(mylist)

However, my code is O( n ²) and really inefficient as I have a huge dataset. I am wondering if there is a more efficient way of doing this in Python.

I am happy to provide more details if needed.

Short regex magic:

import re

mystring = "my friend's new new new new and old old cats are running running in the street"
res = re.sub(r'\b(\w+\s*)\1{1,}', '\\1', mystring)
print(res)

regex pattern details:

  • \\b - word boundary
  • (\\w+\\s*) - one or more word chars \\w+ followed by any number of whitespace characters \\s* - enclosed into a captured group (...)
  • \\1{1,} - refers to the 1st captured group occurred one or more times {1,}

The output:

my friend's new and old cats are running in the street

Using itertools.groupby :

import itertools

>> ' '.join(k for k, _ in itertools.groupby(mystring.split()))
"my friend's new and old cats are running in the street"
  • mystring.split() splits the mystring .
  • itertools.groupby efficiently groups the consecutive words by k .
  • Using list comprehension, we just take the group key.
  • We join using a space.

The complexity is linear in the size of the input string.

Try this :

mystring = "my friend's new new new new and old old cats are running running in the street"

words = mystring.split()

answer = [each_pair[0] for each_pair in zip(words, words[1:]) if each_pair[0] != each_pair[1]] + [words[-1]]

print(' '.join(answer))

Output :

my friend's new and old cats are running in the street

In this we iterate on tuples of consecutive words and append the first word from each tuple to answer if both words in the tuple are different. And in the end we also append the last word to the answer

And now for something different. This solution uses generators until the final reassembly of the result string to be as memory efficient as possible in case the original string was very large.

import re

def remove_duplicates_helper(s):
    words = (x.group(0) for x in re.finditer(r"[^\s]+", s))
    current = None
    for word in words:
        if word != current:
            yield word
            current = word

def remove_duplicates(s):
    return ' '.join(remove_duplicates_helper(s))

mystring = "my friend's new new new new and old old cats are running running in the street"
print(remove_duplicates(mystring))

my friend's new and old cats are running in the street

Please find below my code:

def strip2single(textarr):
    if len(textarr)==0:
        return ""
    result=textarr[0]
    for i in range(1,len(textarr)):
        if textarr[i]!=textarr[i-1]:
            result=result+' '+textarr[i]
    return(result)


mystring = "my friend's new new new new and old old cats are running running in the street"

y=strip2single(mystring.split())
print(y)

O(n) solution exists for this problem.

mystring = "my friend's new new new new and old old cats are running running in the street"

split into words

words = mystring.split()

skip current word if it is equal to previous one

myoutput = ' '.join([x for i,x in enumerate(words) if i==0 or x!=words[i-1]])

The enumerate operation is carried out twice. Altering the code similar to this could make your code efficient.

 mylist = []
 l1 = enumerate(mystring.split())

 for i, w in l1:
     for n, l in l1:
             if l != w and i == n-1:
                     mylist.append(w)
 mylist.append(mystring.split()[-1])
 myoutput = " ".join(mylist)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM