django：iregex区分大小写

Question

Hitting the db (MySQL) with these two queries one right after another I get different results: 用这两个查询一个接一个地命中数据库（MySQL），我得到了不同的结果：

test1 = Agreement.objects.filter(pk=152, company__iregex='СитиСтро(и|й)')
test2 = Agreement.objects.filter(pk=152, company__iregex='ситистро(и|й)')

test1 <QuerySet [<Agreement: Agreement object>]>
test2 <QuerySet []>

with the actual value if the field ' "СитиСтрой" ' 如果字段'“СитиСтрой”'

Now i'm pretty sure that is Cyrillics that is messing things up, because with records in Latin alphabet it works fine, but I have no idea how to go around that (bug?). 现在，我非常确定是Cyrillics搞砸了，因为使用拉丁字母的记录可以正常工作，但是我不知道该如何解决（错误？）。 Any advice here? 有什么建议吗？

PS I did double check, there is no confusion here with similar looking C letters of English and Russian, but with different letter codes. PS我做了仔细检查，这里看起来和英语和俄语的C字母相似，但是字母代码不同，所以没有混淆。

Update: Checked the sql that Django sends to Mysql. 更新：检查了Django发送给Mysql的sql。

('SELECT `dbbs_app_agreement`.`id`, `dbbs_app_agreement`.`company`, '
 'FROM `dbbs_app_agreement` WHERE (`dbbs_app_agreement`.`company` REGEXP '
 'СитиСтро(и|й) AND `dbbs_app_agreement`.`id` = 152)')

Seems fine. 看起来不错。 Tried querying the table directly from phpmyadmin with 尝试直接使用phpmyadmin查询表

SELECT `dbbs_app_agreement`.`id`, `dbbs_app_agreement`.`company` FROM `dbbs_app_agreement` WHERE (`dbbs_app_agreement`.`id` = 152 AND `dbbs_app_agreement`.`company` REGEXP 'С')

which worked, but 起作用了，但是

SELECT `dbbs_app_agreement`.`id`, `dbbs_app_agreement`.`company` FROM `dbbs_app_agreement` WHERE (`dbbs_app_agreement`.`id` = 152 AND `dbbs_app_agreement`.`company` REGEXP 'с')

at the same time does not. 同时没有。

As @AndreyShipilov below offered, made a new table in the db from scratch with utf8_unicode_ci collation, inserted there the value in question (ООО "СитиСтрой") and tried these two queries from phpmyadmin: 如下面的@AndreyShipilov所述，使用utf8_unicode_ci归类从头开始在数据库中创建一个新表，在其中插入有问题的值（ООО“СитиСтрой”），并尝试通过phpmyadmin进行以下两个查询：

SELECT `company`.`id`, `company`.`company` FROM `company` WHERE (`company`.`id` = 0 AND `company`.`company` REGEXP 'с')
SELECT `company`.`id`, `company`.`company` FROM `company` WHERE (`company`.`id` = 0 AND `company`.`company` REGEXP 'С')

Second one works, first one does not. 第二个有效，第一个无效。 Really weird. 真的很奇怪。

update2 My initial code that formed the query looked like that: update2我构成查询的初始代码如下所示：

query_ka_name = reduce(operator.and_,
(Q(company__iregex=r'(([^\w]|^){i}([^\w]|$))'.format(i=re.sub(r'и|й', '(и|й)', item, flags=re.IGNORECASE)))

the purpose of that being to check if a db record corresponded to the array of keywords recognized from a scan as a company name. 目的是检查数据库记录是否对应于从扫描中识别为公司名称的关键字数组。 Since the scanner is really bad with differentiating й from и, and db records are beyond my control I added that little thing to consider these letters as one. 由于扫描仪的确很难区分и和й，并且db记录超出了我的控制范围，因此我添加了一点东西就可以将这些字母视为一个字母。

Now the code looks like that: 现在，代码如下所示：

query_ka_name = reduce(operator.and_, (Q(company__iregex=tambourine(item)) for item in ka_name_listed))

def tambourine(string):
    string = re.sub(r'и|й', '(и|й)', string, flags=re.IGNORECASE)
    output = ''
    for char in string:
        if char.isalpha():
            output = '{o}({u}|{l})'.format(o=output, u=char.upper(), l=char.lower())
        else:
            output = '{o}{c}'.format(o=output, c=char)
    output = r'(([^\w]|^){i}([^\w]|$))'.format(i=output)
    return output

that is probably slow as hell in comparison, but at least it works. 相比之下，这可能慢得要命，但至少可以奏效。 Would still greatly appreciate a solution to the problem. 仍将不胜感激解决该问题。

Answer 1

"LATIN SMALL LETTER C is not considered to be the same as "CYRILLIC SMALL LETTER ES". “拉丁文小写字母C不被视为与“西里尔小写字母ES”相同。
Ditto for "CYRILLIC SMALL LETTER I" and "CYRILLIC SMALL LETTER SHORT I" “西里尔小写字母I”和“西里尔小写字母I”的同上
MySQL's REGEXP works with bytes not characters . MySQL的REGEXP使用字节而非字符。 hence only unaccented English letters work in REGEXP ; 因此在REGEXP只有不加重的英语字母起作用; no Cyrillic letter can (reliably) work. 西里尔字母不能（可靠地）工作。
MariaDB 10.0.5's REGEXP should do a better job. MariaDB 10.0.5的REGEXP应该做得更好。 Ref: https://mariadb.com/kb/en/mariadb/pcre/ 参考： https : //mariadb.com/kb/zh/mariadb/pcre/

Answer 2

I suggest switching to Postgres database which handles non-latin symbols pretty good. 我建议切换到可很好处理非拉丁符号的Postgres数据库。

Just tried to reproduce your issue on my Django 1.10 and Postgres 9.6 setup. 刚刚尝试在我的Django 1.10和Postgres 9.6设置上重现您的问题。

from django.contrib.auth.models import User users = User.objects.filter(username__iregex='Сосницки(и|й)') users <QuerySet [<User: Сосницкий>, <User: сосницкий>, <User: сосницкии>, <User: СоСницкии>]>

Seems to be working. 似乎正在工作。

django：iregex区分大小写

问题描述

2 个解决方案

解决方案1
1 2017-03-22 01:27:44

解决方案2
0 2017-03-21 09:33:17

django：iregex区分大小写

问题描述

2 个解决方案

解决方案1 1 2017-03-22 01:27:44

解决方案2 0 2017-03-21 09:33:17

解决方案1
1 2017-03-22 01:27:44

解决方案2
0 2017-03-21 09:33:17