简体   繁体   中英

Number of words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word

I need to count words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word. I trying to do it with re , and now it seems like

begin_searcher = re.compile(r'[0-9]+[\w\-]')
middle_searcher = re.compile(r'[\w\-]+[0-9]+[\w\-]')
both_searcher = re.compile(r'[0-9]+[\w\-]+[0-9]+[\w\-]')

But it works completely wrong. Anyone, who knows re better me, please help.

I need to count this:

'asfas1254asffas'
'125safasffa'
'asd!asfg'
'asff#dasf'
'sex!!!!'
'safщовфау'

etc

Since you mentioned "non-english" characters, I recommend using regex instead of stock re , because of the weak unicode support in the latter. Unless I misunderstood the question, you're looking for something like:

regex.match(ur'^\p{L}*[\p{P}\p{Nd}]*\p{L}+$', s) #

where s is expected to be a unicode object. This matches u"123щовßß" and u"щов456ßß" and rejects u"щовßß!!!" .

If it could help:

def find_alphabetic_words(self, text):
                    letters = ascii_letters
                    letters_nd_term = letters + "?!,."
                    return not any([set(text[:-1]).difference(letters),text[-1] not in letters_nd_term])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM