读取web页面内容

Question

嗨，我想阅读包含使用 java 的德语字符的 web 页面的内容，不幸的是，德语字符显示为奇怪的字符。 任何帮助请这里是我的代码：

String link = "some german link";

            URL url = new URL(link);
            BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
            }

Answer 1

您需要为 InputStreamReader 指定字符集，例如

InputStreamReader(url.openStream(), "UTF-8")

Answer 2

您必须设置正确的编码。 您可以在 HTTP header 中找到编码：

Content-Type: text/html; charset=ISO-8859-1

这可能会在 (X)HTML 文档中被覆盖，请参阅HTML 字符编码

我可以想象你必须考虑许多不同的附加问题来解析 web 页面错误。 但是有不同的 HTTP 客户端库可用于 Java，例如org.apache.httpcomponents 。 代码将如下所示：

DefaultHttpClient httpclient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet("http://www.spiegel.de");

try
{
  HttpResponse response = httpclient.execute(httpGet);
  HttpEntity entity = response.getEntity();
  if (entity != null)
  {
    System.out.println(EntityUtils.toString(entity));
  }
}
catch (ClientProtocolException e) {e.printStackTrace();}
catch (IOException e) {e.printStackTrace();}

这是maven神器：

<dependency>
  <groupId>org.apache.httpcomponents</groupId>
  <artifactId>httpclient</artifactId>
  <version>4.1.1</version>
  <type>jar</type>
  <scope>compile</scope>
</dependency>

Answer 3

尝试设置一个字符集。

new BufferedReader(new InputStreamReader(url.openStream(), Charset.forName("UTF-8") ));

Answer 4

首先，验证您使用的字体是否支持您尝试显示的特定德语字符。 很多 fonts 并没有携带所有字符，当它是一个简单的“缺少字符”问题时，寻找其他原因是一个很大的痛苦。

如果这不是问题，那么您输入或 output 的字符集错误。 字符集决定了代表字符的数字如何映射到字形（或代表字符的图片）。 Java内部通常使用UTF-8； 所以 output stream 可能不是问题。 检查输入 stream。

读取web页面内容

问题描述

4 个解决方案

解决方案1
6 2011-05-31 14:16:42

解决方案2
2 已采纳 2011-05-31 14:22:31

解决方案3
0 2011-05-31 14:17:03

解决方案4
0 2011-05-31 14:17:42

读取web页面内容

问题描述

4 个解决方案

解决方案1 6 2011-05-31 14:16:42

解决方案2 2 已采纳 2011-05-31 14:22:31

解决方案3 0 2011-05-31 14:17:03

解决方案4 0 2011-05-31 14:17:42

解决方案1
6 2011-05-31 14:16:42

解决方案2
2 已采纳 2011-05-31 14:22:31

解决方案3
0 2011-05-31 14:17:03

解决方案4
0 2011-05-31 14:17:42