简体   繁体   English

如何在python中使用regex和re.sub查找所有大写和小写的unicode字符?

[英]How to find all capital and lower case occurrences of unicode character using regex and re.sub in Python?

This is my code in django view (intentionally simplified)(Python 2.7): 这是我在django视图中的代码(有意简化)(Python 2.7):

# -*- coding: utf-8 -*-
from django.shortcuts import render
import re

def index(request):
    found_verses = [] 
    pattern = re.compile('ю')

    with open('d.txt', 'r') as doc:
        for line in doc:

            found = pattern.search(line)

            if found:
                modified_line = pattern.sub('!'+'\g<0>'+'!',line)
                found_verses.append(modified_line)

context = {'found_verses': found_verses}
return render(request, 'myapp/index.html', context)

d.txt (also utf-8) contains this one line (intentionally simplified): d.txt (也是utf-8)包含这一行(有意简化):

1. Я сказал Юлию одному.

The above, when rendered, gives me the expected result: 上面的渲染时给了我预期的结果:

1. Я сказал Юли!ю! одному.

When I change to a capital letter pattern = re.compile('Ю') , it also gives me the expected result: 当我更改为大写字母pattern = re.compile('Ю') ,它也给了我预期的结果:

1. Я сказал !Ю!лию одному.

But when I change to a group pattern = re.compile('[юЮ]') or pattern = re.compile('[Юю]') or pattern = re.compile('[ю]') or pattern = re.compile('[Ю]') , it gives me nothing. 但是当我更改为一个组时, pattern = re.compile('[юЮ]')pattern = re.compile('[Юю]')pattern = re.compile('[ю]')pattern = re.compile('[Ю]') ,它什么也没有给我。 What I am trying to get is that: 我想要得到的是:

1. Я сказал !Ю!ли!ю! одному.

Please help me to get this result. 请帮助我获得此结果。 I've been struggling for more than a day and tried different configurations like pattern = re.compile('[юЮ]', re.UNICODE) and pattern = re.compile('ю', re.UNICODE|re.I) and this and countless others but all in vain. 我已经奋斗了一天多,并尝试了不同的配置,例如pattern = re.compile('[юЮ]', re.UNICODE)pattern = re.compile('ю', re.UNICODE|re.I) 这个和无数其他人却徒劳无功。

Use unicode s. 使用unicode

with io.open('d.txt', 'r', encoding='utf-8') as doc:
   ...

... ...

pattern = re.compile(u'[юЮ]', re.UNICODE)

just a guess but try this 只是一个猜测,但是尝试这个

with open('d.txt', 'rb') as doc: #I guess you probably dont need the b flag for utf8 but meh
        for line in doc:
            line = line.decode("utf8")
             ...

The problem is probably that you are using regular strings, not unicode strings. 问题可能是您使用的是常规字符串,而不是unicode字符串。 The re library needs to know how to treat the bytes in your RE. re库需要知道如何处理RE中的字节。 Try 尝试

re.compile(u'ю')

(Note that this is how @Ignacio does it in his answer). (请注意,这是@Ignacio在其答案中的做法)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM