简体   繁体   English

使用regex unicode进行mysql查询

[英]mysql query with regex unicode

I would like to make a mysql query to catch : أرأء 我想做一个mysql查询来抓住: أرأء

this char أ may be typed like: ( أ or إ or ا or آ ) 此炭أ可键入这样的:( أإاآ

so when type: 所以当键入:

$SQL=" select * from work where title REGEX '[\\u0622|\\u0623|\\u0625|\\u0627][\\u0631][\\u0622|\\u0623|\\u0625|\\u0627][\\u0621]" 

it doesn't work, I think the syntax is not good 它不起作用,我觉得语法不好

MySQL does not have \\u escapes. MySQL没有\\ u转义。 Try to include the raw Unicode character in the query string, and pass it to MySQL in a utf8 connection. 尝试在查询字符串中包含原始Unicode字符,并以utf8连接将其传递给MySQL。 How you might do that depends on what language and connector you are using to talk to MySQL. 如何做到这一点取决于您使用什么语言和连接器与MySQL交谈。 Best would be to pass the pattern string in a parameter from your language's native Unicode string type if you have one; 如果你有一个参数,最好是从你的语言的原生Unicode字符串类型中传递参数中的模式字符串; for example in Python-MySQLdb I can just do: 例如在Python-MySQLdb中我可以这样做:

group= u'[أإاآ]'
pattern= u'%sر%sء' % (chars, chars)
connection.execute('SELECT * FROM work WHERE title REGEX %s', [pattern])

(nb no pipe characters needed in a regex character group) (nb正则表达式字符组中不需要管道符号)

If you really can't get Unicode down your connection at all, MySQL does have a non-standard binary string escape which you could use to get the characters in through another encoding: 如果你真的无法完全取消你的连接,MySQL确实有一个非标准的二进制字符串转义,你可以使用它来通过另一个编码获取字符:

WHERE title REGEX 0x5bd8a3d8a5d8a7d8a25dd8b15bd8a3d8a5d8a7d8a25dd8a1 AS utf8  - hex-encoded UTF-8 encoded string

Generally you want to avoid using REGEX because it means any index on the title column will be ineffective and a full table search will be forced. 通常,您希望避免使用REGEX因为这意味着title列上的任何索引都将无效,并且将强制执行完整的表搜索。

One alternative would be to do a WHERE title IN a list of all 16 possible strings that would match the expression. 一种替代方法是在与表达式匹配的所有16个可能字符串的列表中执行WHERE title IN

(The most performant approach would be to use a database collation which already treats all four characters as equal. I'm not aware of a collation that matches that sloppily though.) (最高效的方法是使用数据库排序规则,它已经将所有四个字符都视为相等。但我并不知道匹配的排序规则与此相符。)

The utf8 for those 4 variants of Alef are D8A3 D8A5 D8A7 D8A2. 这4种Alef变种的utf8是D8A3 D8A5 D8A7 D8A2。 So, 所以,

WHERE HEX(title) REGEXP '^(..)*D8(A3|A5|A7|A2)'

will check for the presence of any of them. 将检查是否存在任何这些。

The ^(..)* matches any number of pairs of characters (hex, in this case) at the beginning of title , then look for any of those 2-byte utf8 codes. ^(..)*匹配title开头的任意数量的字符对(在本例中为十六进制),然后查找任何这些2字节的utf8代码。

This might be what you are striving for: 这可能是你正在努力的目标:

$SQL=" select * from work
    where HEX(title)
        REGEX '^(..)*D8(A2|A3|A5|A7)D8B1D8(A2|A3|A5|A7)D8A1';

^(..)* is to skip over an even number of hex characters (to keep aligned). ^(..)*是跳过偶数个十六进制字符(保持对齐)。
D8(A2|A3|A5|A7) is the utf8 encoding for the 4 Alefs. D8(A2|A3|A5|A7)是4个Alefs的utf8编码。
D8B1 is for Reh. D8B1代表Reh。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM