I need to count words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word. I trying to do it with re
, and now it seems like
begin_searcher = re.compile(r'[0-9]+[\w\-]')
middle_searcher = re.compile(r'[\w\-]+[0-9]+[\w\-]')
both_searcher = re.compile(r'[0-9]+[\w\-]+[0-9]+[\w\-]')
But it works completely wrong. Anyone, who knows re
better me, please help.
I need to count this:
'asfas1254asffas'
'125safasffa'
'asd!asfg'
'asff#dasf'
'sex!!!!'
'safщовфау'
etc
Since you mentioned "non-english" characters, I recommend using regex instead of stock re
, because of the weak unicode support in the latter. Unless I misunderstood the question, you're looking for something like:
regex.match(ur'^\p{L}*[\p{P}\p{Nd}]*\p{L}+$', s) #
where s
is expected to be a unicode object. This matches u"123щовßß"
and u"щов456ßß"
and rejects u"щовßß!!!"
.
If it could help:
def find_alphabetic_words(self, text):
letters = ascii_letters
letters_nd_term = letters + "?!,."
return not any([set(text[:-1]).difference(letters),text[-1] not in letters_nd_term])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.