I am working on removing or replacing the miscellaneous symbols in a string (in Java) that is being used in text area field in a web application.What is happening is when I use this below content that content is being converted into unicode decimal representation values.
The content is: String a = 'Last Search Results Bulletin Board Validations ⛔ 0 ⚡ 1 ⚠ 6? 0'
when I save that content in the text area( that is in a web page ), that symbols are being saved as ⛔,⚡,⚠.
I want to remove the unicode representation values (or) save the content in the proper format so that I can have proper data to save into the Database.
How Do I remove the unicode representation values for symbols ('⛔' or '⚡' or '⚠') from String? Actually I tried to have regular expression to replace those representations like below s.replaceAll("&#[9728 - 9983];", " "). The range [9728 - 9983] represents the miscellaneous symbol unicode decimal values range.But it is not replacing it properly. Which regular expression can I use? or Which approach can Use to remove the values in a String?
(or)
How Do I convert unicode representation values('⛔' or '⚡' or '⚠') into again same symbols (⛔,⚡,⚠ ) in the String?
I haven't found such utility in stock Java. You'll just have to do it the 'hard' way.
Note that this doesn't cover the hexadeximal equivalents (eg ⛔) or decimal values with lengths not equal to 4.
public static String htmlCharsDecode(String string) {
int length = string.length();
StringBuilder out = new StringBuilder(length);
NumberFormat parser = NumberFormat.getInstance();
ParsePosition pos = new ParsePosition(0);
for (int i = 0; i < length; i++) {
char c = string.charAt(i);
if (c == '&' && i < length - 6 && string.charAt(i + 1) == '#' && string.charAt(i + 6) == ';') {
String codepointString = string.substring(i + 2, i + 6);
pos.setIndex(0);
Number value = parser.parse(codepointString, pos);
boolean isDecimal = pos.getIndex() == codepointString.length();
if (isDecimal) {
int codepoint = value.intValue();
if (codepoint >= 9728 && codepoint <= 9999) {
out.append((char)codepoint);
i += 6;
continue;
}
}
}
out.append(c);
}
return out.toString();
}
You can make parser
and pos
global to prevent creating new objects on each call, but watch out as they are not thread-safe. (and it's not good to prematurely optimize)
You need to render the page in UTF-8, and say in the form that the server accepts UTF-8 in the form data. (Otherwise &#...;
entities are sent instead of Unicode symbols.)
<form action="..." accept-charset="ISO-8859-1">
In HTML 5:
<meta charset="UTF-8">
Older HTML:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Setting the response header accordingly should be done too:
Content-Type: text/html; charset=UTF-8
response.setContentType("text/html; charset=UTF-8");
response.setCharacterEncoding("UTF-8");
There are a few libraries that can un-escape HTML entities eg JSoup Parser.unescapeEntities()
method.
If you want to simply remove emojis take a look at this answer which uses a white-list filter approach :
String input = "Last Validations ⛔ 0 ⚡ 1 ⚠ 6 ? 0";
String characterFilter = "[^\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{Cf}\\p{Cs}\\s]";
String emotionless = input.replaceAll(characterFilter,"");
System.out.println(emotionless); // Last Validations 0 1 6 ? 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.