简体   繁体   English

用于匹配 Unicode 模式的正则表达式

[英]Regex for matching Unicode pattern

I am trying to validate a file's content when is uploaded and I am stuck at the Unicode encoding.我正在尝试在上传文件时验证文件的内容,但我坚持使用 Unicode 编码。 I am not interested to find Unicode special characters, that are not in the ASCII range.我对找到不在 ASCII 范围内的 Unicode 特殊字符不感兴趣。 I am trying to find if the content of the file contains at least one Unicode pattern, like \F for example.我试图找出文件的内容是否至少包含一种 Unicode 模式,例如 \F。

For example, I exclude any file that contains the 'script' word, but what if the file contains this word written in Unicode?例如,我排除了任何包含“script”字样的文件,但如果该文件包含用 Unicode 编写的这个字词怎么办? Sure, Java decodes it into a normal string when it reads the content, but what if I can't rely on this?当然,Java 在读取内容时会将其解码为普通字符串,但是如果我不能依赖它呢?

So, as far as I have searched on the Internet, I've seen Unicode characters written like \F, or like U+0046.所以,据我在互联网上搜索,我看到过像\F 这样写的Unicode 字符,或者像U+0046 这样写的Unicode 字符。 Based on this, I have written the following regex:基于此,我编写了以下正则表达式:

(\\u|U\+)....

This means, \\u or U+ followed by four characters.这意味着 \\u 或 U+ 后跟四个字符。 This pattern accomplishes what I desire, but I wonder if there are any other ways to write a Unicode character.这种模式实现了我的愿望,但我想知道是否还有其他方法可以编写 Unicode 字符。 It is always \\u or U+?它总是 \\u 或 U+? Can it be more or less than 4 characters after \\u or U+? \\u 或 U+ 之后可以多于或少于 4 个字符吗?

Thanks谢谢

The notation U+ Any-number-of-hex-digits belongs to Unicode will not be functional anywhere in code.属于 Unicode 的符号 U+任意数量的十六进制数字在代码中的任何地方都不起作用。 In java source code and *.properties \\u\u003c/code> followed by four hex digits is a UTF-16 encoding of Unicode, automatically parsed.在 java 源代码和 *.properties \\u\u003c/code>后跟四个十六进制数字是 Unicode 的 UTF-16 编码,自动解析。

The pattern to search for that:要搜索的模式:

"\\\\u[0-9A-Fa-f]{4}"

Or a String.contains on:或者一个 String.contains :

"\\u"

In other languages than Java \\Uxxxxxx (six hex chars) is possible, for the full UTF-32 range.对于完整的 UTF-32 范围,在 Java \\Uxxxxxx其他语言中\\Uxxxxxx (六个十六进制字符)是可能的。 Unfortunately upto Java 8 not so.不幸的是,Java 8 并非如此。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM