简体   繁体   English

从String中删除除控制符之外的不可打印的utf8字符

[英]Remove non printable utf8 characters except controlchars from String

I've got a String containing text, control characters, digits, umlauts (german) and other utf8 characters. 我有一个包含文本,控制字符,数字,变音符号(德语)和其他utf8字符的字符串。

I want to strip all utf8 characters which are not "part of the language". 我想删除所有不是“语言的一部分”的utf8字符。 Special characters like (non complete list) ":/\\ßä,;\\n \\t" should all be preserved. 特殊字符如(非完整列表)“:/ \\ßä,; \\ n \\ t”都应保留。

Sadly stackoverflow removes all those characters so I have to append a picture ( link ). 遗憾的是,stackoverflow删除了所有这些字符,因此我必须附加图片( 链接 )。

Any ideas? 有任何想法吗? Help is very appreciated! 非常感谢帮助!

PS: If anybody does know a pasting service which does not kill those special characters I would happily upload the strings.. I just wasn't able to find one.. PS:如果有人知道一个不会杀死那些特殊字符的粘贴服务,我会很乐意上传字符串..我只是找不到一个...

[Edit]: I THINK the regex "\\P{Cc}" are all characters I want to PRESERVE. [编辑]:我认为正则表达式“\\ P {Cc}”是我想要保留的所有字符。 Could this regex be inverted so all characters not matching this regex be returned? 这个正则表达式是否可以反转,以便返回与此正则表达式不匹配的所有字符?

You have already found Unicode character properties. 您已经找到了Unicode字符属性。

You can invert the character property, by changing the case of the leading "p" 您可以通过更改前导“p”的大小写来反转字符属性

eg 例如

\\p{L} matches all letters \\p{L}匹配所有字母

\\P{L} matches all characters that does not have the property letter. \\P{L}匹配所有没有属性字母的字符。

So if you think \\P{Cc} is what you need, then \\p{Cc} would match the opposite. 因此,如果您认为\\P{Cc}是您所需要的,那么\\p{Cc}将与之相反。

More details on regular-expressions.info 有关regular-expressions.info的更多详细信息

I am quite sure \\p{Cc} is close to what you want, but be careful, it does include, eg the tab (0x09), the Linefeed (0x0A) and the Carriage return (0x0D). 我很确定\\p{Cc}接近你想要的,但要小心,它确实包括,例如标签(0x09),换行符(0x0A)和回车符(0x0D)。

But you can create you own character class, like this: 但是你可以创建自己的角色类,如下所示:

[^\P{Cc}\t\r\n]

This class [^...] is a negated character class, so this would match everything that is not "Not control character" (double negation, so it matches control chars), and not tab, CR and LF. 这个类[^...]是一个否定的字符类,所以这将匹配所有不是“非控制字符”(双重否定,因此它匹配控制字符),而不是tab,CR和LF。

您可以使用,

your_string.replaceAll("\\p{C}", "");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM