[英]What are these symbols that crash URLDecoder with UTF-8?
I'm using URLDecoder to decode a string: 我正在使用URLDecoder解码字符串:
import java.net.URLDecoder;
URLDecoder.decode("%u6EDA%u52A8%u8F74%u627F", StandardCharsets.UTF_8.name());
Which leads to the crash 导致坠机
Exception in thread "main" java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - For input string: "u6"
at java.net.URLDecoder.decode(URLDecoder.java:194)
at Playground$.delayedEndpoint$Playground$1(Playground.scala:45)
at Playground$delayedInit$body.apply(Playground.scala:10)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at Playground$.main(Playground.scala:10)
at Playground.main(Playground.scala)
It seems like %u6
and %u8
are not allowed in the string. 似乎
%u6
和%u8
不允许在字符串中。 I've tried to read up on what these symbols are, but I've been unsuccessful. 我尝试阅读这些符号的含义,但没有成功。 I found the string in a dataset in a field called "page title field".
我在名为“页面标题字段”的字段的数据集中找到了该字符串。 So I'm suspecting they are encoded symbols, I just don't know which encoding.
所以我怀疑它们是编码符号,我只是不知道哪种编码。 Does anyone know what these symbols are and which encoding I should use to successfully decode them?
有谁知道这些符号是什么以及我应该使用哪种编码才能成功对其进行解码?
Looks like a non-standard UTF-16-based encoding of "滚动轴承", which is Chinese for "ball bearings". 看起来像是基于非标准UTF-16的“滚动轴承” 编码 ,中文编码为“滚珠轴承”。
I'd suggest to just .replaceAll
%u
by backslashes, and then use StringEscapeUtils
from Apache Commons: 我建议只用反斜杠
.replaceAll
%u
,然后使用来自Apache Commons的StringEscapeUtils
:
import org.apache.commons.lang3.StringEscapeUtils
val unescapedJava = StringEscapeUtils.unescapeJava(str.replaceAll("%u", "\\u"))
URLDecoder.decode(unescapedJava, StandardCharsets.UTF_8.name())
This should handle both kinds of escaping: 这应该处理两种转义:
%
followed by digits are unaffected by the replacement and unescapeJava
%
后跟数字,不受替换和unescapeJava
%u
are treated specially (replaced by \\u\u003c/code> ), and eliminated in the first step.
奇怪的%u
经过特殊处理(用\\u\u003c/code>代替),并在第一步中消除。
If (only if) you are absolutely certain that all code points got encoded in this way, then you can do without
StringEscapeUtils
: 如果 (仅)您绝对确定所有代码点都已通过这种方式编码,那么您可以不使用
StringEscapeUtils
:
new String(
"%u6EDA%u52A8%u8F74%u627F"
.replaceAll("%u", "")
.grouped(4)
.map(Integer.parseInt(_, 16).toChar)
.toArray
)
which produces
产生
res: String = 滚动轴承
but I'd advice against it, because this method will break down for inputs like
"%u6EDA%u52A8%u8F74%u627Fcafebabe"
that contain unescaped characters. 但我建议不
"%u6EDA%u52A8%u8F74%u627Fcafebabe"
,因为这种方法对于包含未转义字符的"%u6EDA%u52A8%u8F74%u627Fcafebabe"
类的输入会"%u6EDA%u52A8%u8F74%u627Fcafebabe"
。 Better use a reliable library method that handles all corner cases. 最好使用处理所有极端情况的可靠库方法。
Your string "%u6EDA%u52A8%u8F74%u627F"
is syntactically wrong as a URL-encoded string. 您的字符串
"%u6EDA%u52A8%u8F74%u627F"
在语法上错误地作为URL编码的字符串。 According to the javadoc of URLDecoder.decode and Wikipedia:Percent-encoding every %
must be followed two hexadecimal digits. 根据URLDecoder.decode和Wikipedia的javadoc :%编码必须在每个
%
之后加上两个十六进制数字。
May be you meant to use "\滚\动\轴\承"
instead. 可能是您打算使用
"\滚\动\轴\承"
代替。 This would be a syntactically correct Java string (having 4 hexadecimal escaped Unicode characters) and is equivalent to "滚动轴承"
. 这将是语法上正确的Java字符串(具有4个十六进制转义的Unicode字符),并且等效于
"滚动轴承"
。 But it still doesn't make sense to URL-decode this string. 但是用URL解码此字符串仍然没有意义。 Therefore I guess the error already occurred on the encoding side, which produced this malformed URL-encoded string in the first place.
因此,我猜想该错误已在编码端发生,该错误首先产生了此格式错误的URL编码的字符串。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.