[英]Remove all except the chinese characters with regex?
I have a string that is a sentence, written in chinese.我有一个字符串是一个句子,用中文写的。
This contains chinese characters, and other filler things, like spaces, comma, exclamation marks and etc., all encoded in UTF8.这包含中文字符和其他填充物,如空格、逗号、感叹号等,均以 UTF8 编码。
Using regex with a latin1 string, I could use preg_replace
and [a-zA-Z]
to clean it and remove the filler.使用带有 latin1 字符串的正则表达式,我可以使用
preg_replace
和[a-zA-Z]
来清洁它并去除填充物。
How can I keep only the chinese "alphabet" characters in the chinese string while removing all the filler items?如何在删除所有填充项的同时仅保留中文字符串中的中文“字母”字符?
According to this document , here are the unicode ranges of chinese characters:根据this document ,这里是汉字的unicode范围:
Table 12-2.表 12-2。 Blocks Containing Han Ideographs
包含汉字的块
Block Range Comment
CJK Unified Ideographs 4E00–9FFF Common
CJK Unified Ideographs Extension A 3400–4DBF Rare
CJK Unified Ideographs Extension B 20000–2A6DF Rare, historic
CJK Unified Ideographs Extension C 2A700–2B73F Rare, historic
CJK Unified Ideographs Extension D 2B740–2B81F Uncommon, some in current use
CJK Compatibility Ideographs F900–FAFF Duplicates, unifiable variants, corporate
characters
CJK Compatibility Ideographs Supplement 2F800–2FA1F Unifiable variants
You could use it like this:你可以这样使用它:
preg_replace('/[^\u4E00-\u9FFF]+/u', '', $string);
or要么
preg_replace('/\P{Han}+/u', '', $string);
where \\P
is the negation of \\p
其中
\\P
是\\p
的否定
希望对你有用。
str1 = Regex.Replace(str1, @"[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]", "");
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.