简体   繁体   English

读取web页面内容

[英]Reading the content of web page

Hi I want to read the content of a web page that contains a German characters using java, unfortunately, the German characters appear as strange characters.嗨,我想阅读包含使用 java 的德语字符的 web 页面的内容,不幸的是,德语字符显示为奇怪的字符。 Any help please here is my code:任何帮助请这里是我的代码:

String link = "some german link";

            URL url = new URL(link);
            BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
            }

You need to specify the character set for your InputStreamReader, like您需要为 InputStreamReader 指定字符集,例如

InputStreamReader(url.openStream(), "UTF-8") 

You have to set the correct encoding.您必须设置正确的编码。 You can find the encoding in the HTTP header:您可以在 HTTP header 中找到编码:

Content-Type: text/html; charset=ISO-8859-1

This may be overwritten in the (X)HTML document, see HTML Character encodings这可能会在 (X)HTML 文档中被覆盖,请参阅HTML 字符编码

I can imagine that you have to consider many different additional issues to pars a web page error free.我可以想象你必须考虑许多不同的附加问题来解析 web 页面错误。 But there are different HTTP client libraries available for Java, eg org.apache.httpcomponents .但是有不同的 HTTP 客户端库可用于 Java,例如org.apache.httpcomponents The code will look like this:代码将如下所示:

DefaultHttpClient httpclient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet("http://www.spiegel.de");

try
{
  HttpResponse response = httpclient.execute(httpGet);
  HttpEntity entity = response.getEntity();
  if (entity != null)
  {
    System.out.println(EntityUtils.toString(entity));
  }
}
catch (ClientProtocolException e) {e.printStackTrace();}
catch (IOException e) {e.printStackTrace();}

This is the maven artifact:这是maven神器:

<dependency>
  <groupId>org.apache.httpcomponents</groupId>
  <artifactId>httpclient</artifactId>
  <version>4.1.1</version>
  <type>jar</type>
  <scope>compile</scope>
</dependency>

Try to set an Charset.尝试设置一个字符集。

new BufferedReader(new InputStreamReader(url.openStream(), Charset.forName("UTF-8") ));

First, verify that the font you are using can support the particular German characters you are trying to display.首先,验证您使用的字体是否支持您尝试显示的特定德语字符。 Many fonts don't carry all characters, and it is a big pain looking for other reasons when it's a simple "missing character" issue.很多 fonts 并没有携带所有字符,当它是一个简单的“缺少字符”问题时,寻找其他原因是一个很大的痛苦。

If that's not the issue, then either you input or output is in the wrong character set.如果这不是问题,那么您输入或 output 的字符集错误。 Character sets determine how the number representing the character gets mapped to the glyphs (or pictures representing the characters).字符集决定了代表字符的数字如何映射到字形(或代表字符的图片)。 Java typically uses UTF-8 internally; Java内部通常使用UTF-8; so the output stream is likely not the issue.所以 output stream 可能不是问题。 Check the input stream.检查输入 stream。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM