Python letters duplicates replace in Unicode string

Question

I need to replace two mistyped letters in a string, for example "bbig". But it works only for Latin Letters, not for Cyrillic. I am using Python version 2.6.6 under Centos Linux.

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
def reg(item):
  item = re.sub(r'([A-ZА-ЯЁЄЇІ])\1', r'\1', item, re.U)
  #this work only with latin too
  #item = re.sub(r'(.)\1', r'\1', item, re.U)
  return item

print reg('ББООЛЛЬЬШШООЙЙ')
print reg('BBIIGG')

The code above returns:

ББООЛЛЬЬШШООЙЙ
BIG

What did I do wrong? Thanks for your help.

Answer 1

You are using byte strings. This makes everything you use match and replace bytes. That won't work if you want to match and replace letters.

Use unicode strings instead:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
def reg(item):
  item = re.sub(ur'([A-ZА-ЯЁЄЇІ])\1', r'\1', item, re.U)
  #this work only with latin too
  #item = re.sub(r'(.)\1', r'\1', item, re.U)
  return item

print reg(u'ББООЛЛЬЬШШООЙЙ')
print reg(u'BBIIGG')

Note that this works fine for precomposed characters but will fall flat with characters composed using combining marks.

It will also be disastrous if the user tries to type this very sentence (hint: check its second word).

Python letters duplicates replace in Unicode string

Question

1 answers

solution1
2 ACCPTED 2013-05-24 13:26:15

Python letters duplicates replace in Unicode string

Question

1 answers

solution1 2 ACCPTED 2013-05-24 13:26:15

solution1
2 ACCPTED 2013-05-24 13:26:15