简体   繁体   English

Java中拉丁字符的URL编码

[英]URL encoding for latin characters in Java

I'm trying to read in an image URL. 我正在尝试读取图片网址。 As mentioned in the java documentation, I tried converting the URL to URI by 如Java文档中所述,我尝试通过以下方式将URL转换为URI:

String imageURL = "http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg";
URL url = new URL(imageURL);
url = new URI(url.getProtocol(), url.getHost(), url.getFile(), null).toURL();  
URLConnection conn = url.openConnection();
InputStream is = conn.getInputStream();

I get the a Java.io.FileNotFound Exception for file http://www.shefinds.com/files/Christian-Louboutin-Dà ©colleté-100-pumps.jpg 我收到文件http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg的Java.io.FileNotFound异常

What am I doing wrong and what is the right way to encode this URL? 我做错了什么,编码此URL的正确方法是什么?

Update: 更新:
I'm using Rome to read in RSS feeds. 我正在使用罗马阅读RSS提要。 Taking suggestions from BalusC I have printed out the raw input from different stages and seems like that the ROME rss parser is using ISO-8859-1 instead of UTF-8. 根据BalusC的建议,我已经打印出了不同阶段的原始输入,并且似乎ROME rss解析器正在使用ISO-8859-1而不是UTF-8。

Works fine here (returns a 403, it's at least not a 404): 在这里工作正常(返回403,至少不是404):

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
InputStream input = connection.getInputStream();

When I fix it so that it doesn't return a 403, the picture is correctly retireved: 当我对其进行修复以使其不返回403时,该图片已正确恢复:

URL url = new URL("http://www.shefinds.com/files/Christian-Louboutin-Décolleté-100-pumps.jpg");
URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/4.0");
InputStream input = connection.getInputStream();
OutputStream output = new FileOutputStream("/pic.jpg");
for (int data = 0; (data = input.read()) != -1;) {
    output.write(data));
}

So your problem lies somewhere else. 因此,您的问题出在其他地方。 Converting is actually not needed. 实际上不需要转换。 The initial URL is valid. 初始URL有效。

Maybe you're obtaining the actual URL from some binary source using the wrong character encoding? 也许您使用错误的字符编码从某个二进制源获取了实际的URL? The transition of é to é namely suggests that the original source was UTF-8 encoded and that the code has incorrectly read it in in using ISO-8859-1 instead of UTF-8. éé的过渡即表明原始源是UTF-8编码的,并且代码在使用ISO-8859-1而不是UTF-8时错误地读取了它。

Update : or maybe you've actually hardcoded it in the Java source code and saving the source file itself using the wrong encoding. 更新 :或您实际上已经在Java源代码中对其进行了硬编码,并使用错误的编码来保存源文件本身。 I've configured my editor (Eclipse) to save files using UTF-8 and the -Dfile.encoding is also defaulted to UTF-8, that would explain why it works at my machine ;) 我已经将编辑器(Eclipse)配置为使用UTF-8保存文件,并且-Dfile.encoding也默认为UTF-8,这将解释为什么它可以在我的机器上工作 ;)

Update 2 : as per the comments, in a nutshell, everything should work fine if the encoding used to save the source file matches the default -Dfile.encoding of the runtime platform (and the character encoding in question supports the é ). 更新2 :简而言之,根据注释,如果用于保存源文件的编码与运行时平台的默认-Dfile.encoding匹配(并且所讨论的字符编码支持é ),那么一切都应正常工作。 To avoid those unforeseen clashes whenever you like to distribute the code, it's indeed better to replace hardcoded non-ASCII chars by unicode escapes. 为了避免每当您希望分发代码时发生那些无法预料的冲突,确实最好用unicode转义符替换硬编码的非ASCII字符。

I think the technical answer is "you can't." 我认为技术上的答案是“你做不到”。 Non-ASCII characters can't be used in a URL according to the standard, and even some ASCII characters must be escaped with "%XX" syntax, where XX is the ASCII value of the character. 根据标准,URL中不能使用非ASCII字符,甚至某些ASCII字符也必须使用“%XX”语法进行转义,其中XX是字符的ASCII值。

If anything, you can escape 'é' with '%E9' but this relies on the server interpreting this as an encoding of the character according to ISO-8859-1. 如果有的话,您可以使用'%E9'转义'é',但这取决于服务器将其解释为根据ISO-8859-1的字符编码。 While this isn't technically allowed, I believe many servers will do it. 尽管从技术上来讲这是不允许的,但我相信许多服务器都可以做到。

The encoding of your source file is to blame. 您的源文件的编码是罪魁祸首。 Using your IDE, set it to UTF-8, and then repaste the URL. 使用您的IDE,将其设置为UTF-8,然后重新输入URL。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM