简体   繁体   中英

Python regular expression for repeating punctuation and symbols

I need a regex that will match repeating (more than one) punctuation and symbols. Basically all repeating non-alphanumeric and non-whitespace characters such as ..., ???, !!!, ###, @@@, +++ and etc. It must be the same character that's repeated, so not a sequence like "!?@".

I had tried [^\\s\\w]+ and while that covers all off the !!!, ???, $$$ cases, but that gives me more than what I want since it will also match "!?@".

Can someone enlighten me please? Thanks.

I think you're looking for something like this:

[run for run, leadchar in re.findall(r'(([^\\w\\s])\\2+)', yourstring)]

Example:

In : teststr = "4spaces    then(*(@^#$&&&&(2((((99999****"

In : [run for run, leadchar in re.findall(r'(([^\w\s])\2+)',teststr)]
Out: ['&&&&', '((((', '****']

This gives you a list of the runs, excluding the 4 spaces in that string as well as sequences like '*(@^'

If that's not exactly what you want, you might edit your question with an example string and precisely what output you wanted to see.

Try this pattern:

([.\?#@+,<>%~`!$^&\(\):;])\1+

\\1 is referring to the first matched group, which is contents of the parentheses.

You need to extend the list of punctuations and symbols as desired.

EDIT: @Firoze Lafeer posted an answer that does everything with a single regular expression. I'll leave this up in case anyone is interested in combining a regular expression with a filtering function, but for this problem it would be simpler and faster to use Firoze Lafeer's answer.

Answer written before I saw Firoze Lafeer's answer is below, unchanged.

A simple regular expression can't do this. The classic pithy summary is "regular expressions can't count". Discussion here:

How to check that a string is a palindrome using regular expressions?

For a Python solution I would recommend combining a regular expression with a little bit of Python code. The regular expression throws out everything that isn't a run of some sort of punctuation, and then the Python code checks to throw out false matches (matches that are runs of punctuation but not all the same character).

import re
import string

# Character class to match punctuation.  The dash ('-') is special
# in character classes, so put a backslash in front of it to make
# it just a literal dash.
_char_class_punct = "[" + re.escape(string.punctuation) + "]"

# Pattern: a punctuation character followed by one or more punctuation characters.
# Thus, a run of two or more punctuation characters.
_pat_punct_run = re.compile(_char_class_punct + _char_class_punct + '+')

def all_same(seq, basis_case=True):
    itr = iter(seq)
    try:
        first = next(itr)
    except StopIteration:
        return basis_case
    return all(x == first for x in itr)

def find_all_punct_runs(text):
    return [s for s in _pat_punct_run.findall(text) if all_same(s, False)]


# alternate version of find_all_punct_runs() using re.finditer()
def find_all_punct_runs(text):
    return (s for s in (m.group(0) for m in _pat_punct_run.finditer(text)) if all_same(s, False))

I wrote all_same() the way I did so that it will work just as well on an iterator as on a string. The Python built-in all() returns True for an empty sequence, which is not what we want for this particular use of all_same() , so I made an argument for the basis case desired and made it default to True to match the behavior of all() .

This does as much of the work as possible using the internals of Python (the regular expression engine or all() ) so it should be pretty fast. For large input texts you might want to rewrite find_all_punct_runs() to use re.finditer() instead of re.findall() . I gave an example. The example also returns a generator expression rather than a list. You can always force it to make a list:

lst = list(find_all_punct_runs(text))

This is how I would do it:

>>> st='non-whitespace characters such as ..., ???, !!!, ###, @@@, +++ and' 
>>> reg=r'(([.?#@+])\2{2,})'
>>> print [m.group(0) for m in re.finditer(reg,st)]

or

>>> print [g for g,l in re.findall(reg, st)]

Either one prints:

['...', '???', '###', '@@@', '+++']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM