简体   繁体   English

用于在Python中重复字符串中的单词的正则表达式

[英]regex for repeating words in a string in Python

I have a good regexp for replacing repeating characters in a string. 我有一个很好的正则表达式来替换字符串中的重复字符。 But now I also need to replace repeating words, three or more word will be replaced by two words. 但是现在我还需要替换重复的单词,三个或更多的单词将被两个单词替换。

Like 喜欢

bye! bye! bye!

should become 应该成为

bye! bye!

My code so far: 我的代码到目前为止:

def replaceThreeOrMoreCharachetrsWithTwoCharacters(string): 
     # pattern to look for three or more repetitions of any character, including newlines. 
     pattern = re.compile(r"(.)\1{2,}", re.DOTALL) 
     return pattern.sub(r"\1\1", string)

假设您的需求中所谓的“单词”是由空格或字符串限制包围的一个或多个非空格字符,您可以尝试以下模式:

re.sub(r'(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)', r'\1', s)

You could try the below regex also, 你也可以试试下面的正则表达式,

(?<= |^)(\S+)(?: \1){2,}(?= |$)

Sample code, 示例代码,

>>> import regex
>>> s = "hi hi hi hi some words words words which'll repeat repeat repeat repeat repeat"
>>> m = regex.sub(r'(?<= |^)(\S+)(?: \1){2,}(?= |$)', r'\1 \1', s)
>>> m
"hi hi some words words which'll repeat repeat"

DEMO DEMO

I know you were after a regular expression but you could use a simple loop to achieve the same thing: 我知道你是在使用正则表达式但是你可以使用一个简单的循环来实现同样的目的:

def max_repeats(s, max=2):
  last = ''
  out = []
  for word in s.split():
    same = 0 if word != last else same + 1
    if same < max: out.append(word)
    last = word
  return ' '.join(out)

As a bonus, I have allowed a different maximum number of repeats to be specified (the default is 2). 作为奖励,我允许指定不同的最大重复次数(默认值为2)。 If there is more than one space between each word, it will be lost. 如果每个单词之间有多个空格,则会丢失。 It's up to you whether you consider that to be a bug or a feature :) 你是否认为这是一个错误或功能取决于你:)

Try the following: 请尝试以下方法:

import re
s = your string
s = re.sub( r'(\S+) (?:\1 ?){2,}', r'\1 \1', s )

You can see a sample code here: http://codepad.org/YyS9JCLO 您可以在此处查看示例代码: http//codepad.org/YyS9JCLO

def replaceThreeOrMoreWordsWithTwoWords(string):
    # Pattern to look for three or more repetitions of any words.
    pattern = re.compile(r"(?<!\S)((\S+)(?:\s+\2))(?:\s+\2)+(?!\S)", re.DOTALL)
    return pattern.sub(r"\1", string)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM