Python正则表达式，用于重复标点和符号

Question

I need a regex that will match repeating (more than one) punctuation and symbols. 我需要一个正则表达式来匹配重复（多个）标点和符号。 Basically all repeating non-alphanumeric and non-whitespace characters such as ..., ???, !!!, ###, @@@, +++ and etc. It must be the same character that's repeated, so not a sequence like "!?@". 基本上所有重复的非字母数字和非空格字符，例如...，???，!!!，###，@@@，+++等。它必须是重复的同一字符，因此不能像“！？@”这样的序列。

I had tried [^\\s\\w]+ and while that covers all off the !!!, ???, $$$ cases, but that gives me more than what I want since it will also match "!?@". 我尝试过[^ \\ s \\ w] +，虽然涵盖了所有!!!，???，$$的情况，但由于它也可以匹配“！？@”，因此它比我想要的更多。。

Can someone enlighten me please? 有人可以启发我吗？ Thanks. 谢谢。

Answer 1

I think you're looking for something like this: 我认为您正在寻找这样的东西：

[run for run, leadchar in re.findall(r'(([^\\w\\s])\\2+)', yourstring)]

Example: 例：

In : teststr = "4spaces    then(*(@^#$&&&&(2((((99999****"

In : [run for run, leadchar in re.findall(r'(([^\w\s])\2+)',teststr)]
Out: ['&&&&', '((((', '****']

This gives you a list of the runs, excluding the 4 spaces in that string as well as sequences like '*(@^' 这为您提供了运行列表，不包括该字符串中的4个空格以及诸如'*（@ ^'

If that's not exactly what you want, you might edit your question with an example string and precisely what output you wanted to see. 如果这不是您想要的，您可以使用示例字符串以及您想查看的输出内容来编辑问题。

Answer 2

Try this pattern: 试试这个模式：

([.\?#@+,<>%~`!$^&\(\):;])\1+

\\1 is referring to the first matched group, which is contents of the parentheses. \\1指的是第一个匹配组，即括号的内容。

You need to extend the list of punctuations and symbols as desired. 您需要根据需要扩展标点符号列表。

Answer 3

EDIT: @Firoze Lafeer posted an answer that does everything with a single regular expression. 编辑：@Firoze Lafeer发布了一个答案，该答案使用单个正则表达式即可完成所有操作。 I'll leave this up in case anyone is interested in combining a regular expression with a filtering function, but for this problem it would be simpler and faster to use Firoze Lafeer's answer. 万一有兴趣将正则表达式与过滤功能结合使用的情况，我将不再赘述，但是对于此问题，使用Firoze Lafeer的答案将更加简单快捷。

Answer written before I saw Firoze Lafeer's answer is below, unchanged. 在我看到Firoze Lafeer的答案之前写的答案没有变化。

A simple regular expression can't do this. 一个简单的正则表达式不能做到这一点。 The classic pithy summary is "regular expressions can't count". 经典的精妙总结是“正则表达式无法计数”。 Discussion here: 这里的讨论：

How to check that a string is a palindrome using regular expressions? 如何使用正则表达式检查字符串是回文？

For a Python solution I would recommend combining a regular expression with a little bit of Python code. 对于Python解决方案，我建议将正则表达式与少量Python代码结合使用。 The regular expression throws out everything that isn't a run of some sort of punctuation, and then the Python code checks to throw out false matches (matches that are runs of punctuation but not all the same character). 正则表达式会抛出所有不包含某种标点符号的内容，然后Python代码进行检查以抛出错误的匹配项（包含标点符号但并非所有相同字符的匹配项）。

import re
import string

# Character class to match punctuation.  The dash ('-') is special
# in character classes, so put a backslash in front of it to make
# it just a literal dash.
_char_class_punct = "[" + re.escape(string.punctuation) + "]"

# Pattern: a punctuation character followed by one or more punctuation characters.
# Thus, a run of two or more punctuation characters.
_pat_punct_run = re.compile(_char_class_punct + _char_class_punct + '+')

def all_same(seq, basis_case=True):
    itr = iter(seq)
    try:
        first = next(itr)
    except StopIteration:
        return basis_case
    return all(x == first for x in itr)

def find_all_punct_runs(text):
    return [s for s in _pat_punct_run.findall(text) if all_same(s, False)]


# alternate version of find_all_punct_runs() using re.finditer()
def find_all_punct_runs(text):
    return (s for s in (m.group(0) for m in _pat_punct_run.finditer(text)) if all_same(s, False))

I wrote all_same() the way I did so that it will work just as well on an iterator as on a string. 我以这种方式编写了all_same() ，以便它在迭代器上和在字符串上一样好用。 The Python built-in all() returns True for an empty sequence, which is not what we want for this particular use of all_same() , so I made an argument for the basis case desired and made it default to True to match the behavior of all() . Python内置的all()对于空序列返回True ，这对于all_same()特殊使用不是我们想要的，因此我为所需的基本情况设置了一个参数，并将其默认设置为True以匹配行为在all() 。

This does as much of the work as possible using the internals of Python (the regular expression engine or all() ) so it should be pretty fast. 这使用Python的内部组件（正则表达式引擎或all() ）完成了尽可能多的工作，因此它应该非常快。 For large input texts you might want to rewrite find_all_punct_runs() to use re.finditer() instead of re.findall() . 对于大型输入文本，您可能需要重写find_all_punct_runs()以使用re.finditer()而不是re.findall() 。 I gave an example. 我举了一个例子。 The example also returns a generator expression rather than a list. 该示例还返回生成器表达式而不是列表。 You can always force it to make a list: 您可以随时强制其列出：

lst = list(find_all_punct_runs(text))

Answer 4

This is how I would do it: 这就是我要做的：

>>> st='non-whitespace characters such as ..., ???, !!!, ###, @@@, +++ and' 
>>> reg=r'(([.?#@+])\2{2,})'
>>> print [m.group(0) for m in re.finditer(reg,st)]

or 要么

>>> print [g for g,l in re.findall(reg, st)]

Either one prints: 任一打印：

['...', '???', '###', '@@@', '+++']

Python正则表达式，用于重复标点和符号

问题描述

4 个解决方案

解决方案1
2 2013-02-01 03:46:38

解决方案2
1 2013-02-01 02:53:55

解决方案3
1 2013-02-01 03:23:18

解决方案4
0 2013-02-01 03:33:25

Python正则表达式，用于重复标点和符号

问题描述

4 个解决方案

解决方案1 2 2013-02-01 03:46:38

解决方案2 1 2013-02-01 02:53:55

解决方案3 1 2013-02-01 03:23:18

解决方案4 0 2013-02-01 03:33:25

解决方案1
2 2013-02-01 03:46:38

解决方案2
1 2013-02-01 02:53:55

解决方案3
1 2013-02-01 03:23:18

解决方案4
0 2013-02-01 03:33:25