简体   繁体   English

使用 REGEX 从 Java 中的 String 中删除 Unicode 字符

[英]To remove Unicode character from String in Java using REGEX

I am having Input String like below.我有如下输入字符串。

String comment = "Good morning! \u2028\u2028I am looking to purchase a new Honda car as I\u2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a
 little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";

I want to remove Unicode characters like "\
" , "\’" etc if it is present in the comment section.In runtime i don't know what are all extra characters coming.如果注释部分中存在诸如 "\
" 、 "\’" 等 Unicode 字符,我想删除它。在运行时我不知道所有额外的字符是什么。 So what is the best way to handle this?那么处理这个问题的最佳方法是什么?

I tried like below which removes unicode characters in the given string.我尝试如下删除给定字符串中的 unicode 字符。

Comments.replaceAll("\\P{Print}", "");

So what is the best way to match Unicode characters are present in the comment section and if present remove those, otherwise just pass the comment to target system.那么什么是匹配 Unicode 字符的最佳方法存在于注释部分,如果存在则删除它们,否则只需将注释传递给目标系统。

Can anyone please help me to resolve this?任何人都可以帮我解决这个问题吗?

You can do this sequentially like below:您可以按顺序执行此操作,如下所示:

public static void main(final String args[]) {
    String comment = "Good morning! \u2028\u2028I am looking to purchase a new Honda car as I\u2019m outgrowing my current car. I currently drive a Hyundai Accent and I was looking for something a little bit larger and more comfortable like the Honda Civic. May I know if you have any of the models currently in stock? Thank you! Warm regards Sandra";

    // remove all non-ASCII characters
    comment = comment.replaceAll("[^\\x00-\\x7F]", "");

    // remove all the ASCII control characters
    comment = comment.replaceAll("[\\p{Cntrl}&&[^\r\n\t]]", "");

    // removes non-printable characters from Unicode
    comment = comment.replaceAll("\\p{C}", "");
    System.out.println(comment);
  }

If you use replace , you will lost some characters, For example I'm will become Im .如果您使用的replace ,你会失去一些字符,比如I'm将成为Im So the best thing is convert.所以最好的事情是转换。

You can Convert Unicode to UTF-8.您可以将 Unicode 转换为 UTF-8。

byte[] byteComment = comment.getBytes("UTF-8");

String formattedComment = new String(byteComment, "UTF-8");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM