Regex and ISO-8859-1 charset in java

Question

I have some text encoded in ISO-8859-1 which I then extract some data from using Regex.

The problem is that the strings I get from the matcher object is in the wrong format, scrambling chars like "ÅÄÖ".

How do I stop the regex library from scrambling my chars?

Edit: Here's some code:

private HttpResponse sendGetRequest(String url) throws ClientProtocolException, IOException
{
    HttpGet get = new HttpGet(url);
    return hclient.execute(get);
}
private static String getResponseBody(HttpResponse response) throws IllegalStateException, IOException
{
    InputStream input = response.getEntity().getContent();
    StringBuilder builder = new StringBuilder();
    int read;
    byte[] tmp = new byte[1024];

    while ((read = input.read(tmp))!=-1)
    {
        builder.append(new String(tmp), 0,read-1);
    }

    return builder.toString();
}
HttpResponse response = sendGetRequest(url);
String html = getResponseBody(response);
Matcher matcher = forum_pattern.matcher(html);
while(matcher.find()) // do stuff

Answer 1

This is probably the immediate cause of your problem, and it's definitely an error:

builder.append(new String(tmp), 0, read-1);

When you call one of the new String(byte[]) constructors that doesn't take a Charset, it uses the platform default encoding. Apparently, the default encoding on your your platform is not ISO-8859-1. You should be able to get the charset name from the response headers so you can supply it to the constructor.

But you shouldn't be using a String constructor for this anyway; the proper way is to use an InputStreamReader. If the encoding were one of the multi-byte ones like UTF-8, you could easily corrupt the data because a chunk of bytes happened to end in the middle of a character.

In any case, never, ever use a new String(byte[]) constructor or a String.getBytes() method that doesn't accept a Charset parameter. Those methods should be deprecated, and should emit ferocious warnings when anyone uses them.

Answer 2

It's html from a website.

Use a HTML parser and this problem and all future potential problems will disappear.

I can recommend picking Jsoup for the job.

Regex and ISO-8859-1 charset in java

Question

2 answers

solution1
3 ACCPTED 2010-08-08 06:46:04

solution2
2 2010-08-07 21:11:10

See also:

Regex and ISO-8859-1 charset in java

Question

2 answers

solution1 3 ACCPTED 2010-08-08 06:46:04

solution2 2 2010-08-07 21:11:10

See also:

solution1
3 ACCPTED 2010-08-08 06:46:04

solution2
2 2010-08-07 21:11:10