简体   繁体   English

用正则表达式删除除中文字符之外的所有字符?

[英]Remove all except the chinese characters with regex?

I have a string that is a sentence, written in chinese.我有一个字符串是一个句子,用中文写的。

This contains chinese characters, and other filler things, like spaces, comma, exclamation marks and etc., all encoded in UTF8.这包含中文字符和其他填充物,如空格、逗号、感叹号等,均以 UTF8 编码。

Using regex with a latin1 string, I could use preg_replace and [a-zA-Z] to clean it and remove the filler.使用带有 latin1 字符串的正则表达式,我可以使用preg_replace[a-zA-Z]来清洁它并去除填充物。

How can I keep only the chinese "alphabet" characters in the chinese string while removing all the filler items?如何在删除所有填充项的同时仅保留中文字符串中的中文“字母”字符?

According to this document , here are the unicode ranges of chinese characters:根据this document ,这里是汉字的unicode范围:

Table 12-2.表 12-2。 Blocks Containing Han Ideographs包含汉字的块

Block                                Range         Comment
CJK Unified Ideographs               4E00–9FFF     Common
CJK Unified Ideographs Extension A   3400–4DBF     Rare
CJK Unified Ideographs Extension B   20000–2A6DF   Rare, historic
CJK Unified Ideographs Extension C   2A700–2B73F   Rare, historic
CJK Unified Ideographs Extension D   2B740–2B81F   Uncommon, some in current use
CJK Compatibility Ideographs         F900–FAFF     Duplicates, unifiable variants, corporate
characters
CJK Compatibility Ideographs Supplement 2F800–2FA1F Unifiable variants

You could use it like this:你可以这样使用它:

preg_replace('/[^\u4E00-\u9FFF]+/u', '', $string);

or要么

preg_replace('/\P{Han}+/u', '', $string);

where \\P is the negation of \\p其中\\P\\p的否定

see here for all the unicode scripts在这里查看所有unicode scripts

希望对你有用。

str1 = Regex.Replace(str1, @"[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC\uF900-\uFAAD]", "");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM