简体   繁体   中英

Reading the content of web page

Hi I want to read the content of a web page that contains a German characters using java, unfortunately, the German characters appear as strange characters. Any help please here is my code:

String link = "some german link";

            URL url = new URL(link);
            BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));
            String inputLine;
            while ((inputLine = in.readLine()) != null) {
                System.out.println(inputLine);
            }

You need to specify the character set for your InputStreamReader, like

InputStreamReader(url.openStream(), "UTF-8") 

You have to set the correct encoding. You can find the encoding in the HTTP header:

Content-Type: text/html; charset=ISO-8859-1

This may be overwritten in the (X)HTML document, see HTML Character encodings

I can imagine that you have to consider many different additional issues to pars a web page error free. But there are different HTTP client libraries available for Java, eg org.apache.httpcomponents . The code will look like this:

DefaultHttpClient httpclient = new DefaultHttpClient();
HttpGet httpGet = new HttpGet("http://www.spiegel.de");

try
{
  HttpResponse response = httpclient.execute(httpGet);
  HttpEntity entity = response.getEntity();
  if (entity != null)
  {
    System.out.println(EntityUtils.toString(entity));
  }
}
catch (ClientProtocolException e) {e.printStackTrace();}
catch (IOException e) {e.printStackTrace();}

This is the maven artifact:

<dependency>
  <groupId>org.apache.httpcomponents</groupId>
  <artifactId>httpclient</artifactId>
  <version>4.1.1</version>
  <type>jar</type>
  <scope>compile</scope>
</dependency>

Try to set an Charset.

new BufferedReader(new InputStreamReader(url.openStream(), Charset.forName("UTF-8") ));

First, verify that the font you are using can support the particular German characters you are trying to display. Many fonts don't carry all characters, and it is a big pain looking for other reasons when it's a simple "missing character" issue.

If that's not the issue, then either you input or output is in the wrong character set. Character sets determine how the number representing the character gets mapped to the glyphs (or pictures representing the characters). Java typically uses UTF-8 internally; so the output stream is likely not the issue. Check the input stream.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM