Number of words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word

Question

I need to count words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word. I trying to do it with re , and now it seems like

begin_searcher = re.compile(r'[0-9]+[\w\-]')
middle_searcher = re.compile(r'[\w\-]+[0-9]+[\w\-]')
both_searcher = re.compile(r'[0-9]+[\w\-]+[0-9]+[\w\-]')

But it works completely wrong. Anyone, who knows re better me, please help.

I need to count this:

'asfas1254asffas'
'125safasffa'
'asd!asfg'
'asff#dasf'
'sex!!!!'
'safщовфау'

etc

Answer 1

Since you mentioned "non-english" characters, I recommend using regex instead of stock re , because of the weak unicode support in the latter. Unless I misunderstood the question, you're looking for something like:

regex.match(ur'^\p{L}*[\p{P}\p{Nd}]*\p{L}+$', s) #

where s is expected to be a unicode object. This matches u"123щовßß" and u"щов456ßß" and rejects u"щовßß!!!" .

Answer 2

If it could help:

def find_alphabetic_words(self, text):
                    letters = ascii_letters
                    letters_nd_term = letters + "?!,."
                    return not any([set(text[:-1]).difference(letters),text[-1] not in letters_nd_term])

Number of words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word

Question

2 answers

solution1
0 2012-11-10 14:25:11

solution2
0 ACCPTED 2012-12-04 19:53:51

Number of words with non-English characters, special characters such as punctuation, or digits at beginning or middle of word

Question

2 answers

solution1 0 2012-11-10 14:25:11

solution2 0 ACCPTED 2012-12-04 19:53:51

solution1
0 2012-11-10 14:25:11

solution2
0 ACCPTED 2012-12-04 19:53:51