How to find all capital and lower case occurrences of unicode character using regex and re.sub in Python?

Question

This is my code in django view (intentionally simplified)(Python 2.7):

# -*- coding: utf-8 -*-
from django.shortcuts import render
import re

def index(request):
    found_verses = [] 
    pattern = re.compile('ю')

    with open('d.txt', 'r') as doc:
        for line in doc:

            found = pattern.search(line)

            if found:
                modified_line = pattern.sub('!'+'\g<0>'+'!',line)
                found_verses.append(modified_line)

context = {'found_verses': found_verses}
return render(request, 'myapp/index.html', context)

d.txt (also utf-8) contains this one line (intentionally simplified):

1. Я сказал Юлию одному.

The above, when rendered, gives me the expected result:

1. Я сказал Юли!ю! одному.

When I change to a capital letter pattern = re.compile('Ю') , it also gives me the expected result:

1. Я сказал !Ю!лию одному.

But when I change to a group pattern = re.compile('[юЮ]') or pattern = re.compile('[Юю]') or pattern = re.compile('[ю]') or pattern = re.compile('[Ю]') , it gives me nothing. What I am trying to get is that:

1. Я сказал !Ю!ли!ю! одному.

Please help me to get this result. I've been struggling for more than a day and tried different configurations like pattern = re.compile('[юЮ]', re.UNICODE) and pattern = re.compile('ю', re.UNICODE|re.I) and this and countless others but all in vain.

Answer 1

Use unicode s.

with io.open('d.txt', 'r', encoding='utf-8') as doc:
   ...

...

pattern = re.compile(u'[юЮ]', re.UNICODE)

Answer 2

just a guess but try this

with open('d.txt', 'rb') as doc: #I guess you probably dont need the b flag for utf8 but meh
        for line in doc:
            line = line.decode("utf8")
             ...

Answer 3

The problem is probably that you are using regular strings, not unicode strings. The re library needs to know how to treat the bytes in your RE. Try

re.compile(u'ю')

(Note that this is how @Ignacio does it in his answer).

How to find all capital and lower case occurrences of unicode character using regex and re.sub in Python?

Question

3 answers

solution1
3 ACCPTED 2014-04-06 20:15:51

solution2
1 2014-04-06 20:16:07

solution3
0 2014-04-06 20:28:47

How to find all capital and lower case occurrences of unicode character using regex and re.sub in Python?

Question

3 answers

solution1 3 ACCPTED 2014-04-06 20:15:51

solution2 1 2014-04-06 20:16:07

solution3 0 2014-04-06 20:28:47

solution1
3 ACCPTED 2014-04-06 20:15:51

solution2
1 2014-04-06 20:16:07

solution3
0 2014-04-06 20:28:47