简体   繁体   中英

Encoding issue while reading the content of an email with JavaMail

I'm reading the messages from an email account by using JavaMail 1.4.1 (I've upgraded to 1.4.5 version but with the same problem), but I'm having issues with the encoding of the content:

POP3Message pop3message;
... 
Object contentObject = pop3message.getContent();
...   
String contentType = pop3message.getContentType();
String content = contentObject.toString();

Some messages are read properly, but others have strange characters because of a not suitable encoding. I have realized it doesn't work for a specific content type.

It works well if the contentType is any of these:

  • text/plain; charset=ISO-8859-1

  • text/plain;
    charset="iso-8859-1"

  • text/plain;
    charset="ISO-8859-1";
    format="flowed"

  • text/plain; charset=windows-1252

but it doesn't if it is:

  • text/plain;
    charset="utf-8"

for this contentType (UTF-8 one) if I try to get the encoding (pop3message.getEncoding()) I get

quoted-printable

For the latter encoding I get for example in the debugger in the String value (in the same way as I see it in the database after persisting the object):

Ubicación (instead of Ubicación)

But if I open the email with the email client in a browser it can be read without any problem, and it's a normal message (no attachments, just text), so the message seems to be OK.

Any idea about how to solve this issue?

Thanks.


UPDATE This is the piece of code I've added to try the function getUTF8Content() given by jlordo

POP3Message pop3message = (POP3Message) message;
String uid = pop3folder.getUID(message);

//START JUST FOR TESTING PURPOSES
if(uid.trim().equals("1401")){
    Object utfContent = pop3message.getContent();
    System.out.println(utfContent.getClass().getName()); // it is of type String
    //System.out.println(utfContent); // if not commmented it prints the content of one of the emails I'm having problems with.
    System.out.println(pop3message.getEncoding()); //prints: quoted-printable
    System.out.println(pop3message.getContentType()); //prints: text/plain; charset="utf-8"
    String utfContentString = getUTF8Content(utfContent); // throws java.lang.ClassCastException: java.lang.String cannot be cast to javax.mail.util.SharedByteArrayInputStream
    System.out.println(utfContentString);
}

//END TEST CODE

How are you detecting that these messages have "strange characters"? Are you displaying the data somewhere? It's possible that whatever method you're using to display the data isn't handling Unicode characters properly.

The first step is to determine whether the problem is that you're getting the wrong characters, or that the correct characters are being displayed incorrectly. You can examine the Unicode values of each character in the data (eg, in the String returned from the getContent method) to make sure each character has the correct Unicode value. If it does, the problem is with the method you're using to display the characters.

try this and let me know if it works:

if ( *check if utf 8 here* ) {
    content = getUTF8Content(contentObject);
}

// TODO take care of UnsupportedEncodingException, 
// IOException and ClassCastException
public static String getUTF8Content(Object contentObject) {
    // possible ClassCastException
    SharedByteArrayInputStream sbais = (SharedByteArrayInputStream) contentObject;
    // possible UnsupportedEncodingException
    InputStreamReader isr = new InputStreamReader(sbais, Charset.forName("UTF-8"));
    int charsRead = 0;
    StringBuilder content = new StringBuilder();
    int bufferSize = 1024;
    char[] buffer = new char[bufferSize];
    // possible IOException
    while ((charsRead = isr.read(buffer)) != -1) {
        content.append(Arrays.copyOf(buffer, charsRead));
    }
    return content.toString();
}

BTW, is JavaMail 1.4.1 a requirement? Up to date version is 1.4.5.

What worked for me was that I called getContentType() and I would check if the String contains a "utf" in it (defining the charset used as one of UTF).

If yes, I would treat the content differently in this case.

private String encodeCorrectly(InputStream is) {
    java.util.Scanner s = new java.util.Scanner(is, StandardCharsets.UTF_8.toString()).useDelimiter("\\A");
    return s.hasNext() ? s.next() : "";
}

(a modification of a IS to String converter from this answer on SO )

The important part here is using the correct Charset. This solved the issue for me.

First of all you must add headers according to UTF-8 encoding this way:

...
MimeMessage msg = new MimeMessage(session);
msg.setHeader("Content-Type", "text/html; charset=UTF-8");
msg.setHeader("Content-Transfer-Encoding", "8bit");

msg.setFrom(new InternetAddress(doConversion(from)));
msg.setRecipients(javax.mail.Message.RecipientType.TO, address);
msg.setSubject(asunto, "UTF-8");

MimeBodyPart mbp1 = new MimeBodyPart();
mbp1.setContent(text, "text/html; charset=UTF-8");
Multipart mp = new MimeMultipart();
mp.addBodyPart(mbp1);
...

But for 'from' header, i use the following method to convert characters:

public String doConversion(String original) {
    if(original == null) return null;
    String converted = original.replaceAll("á", "\u00c3\u00a1");
    converted = converted.replaceAll("Á", "\u00c3\u0081");
    converted = converted.replaceAll("é", "\u00c3\u00a9");
    converted = converted.replaceAll("É", "\u00c3\u0089");
    converted = converted.replaceAll("í", "\u00c3\u00ad");
    converted = converted.replaceAll("Í", "\u00c3\u008d");
    converted = converted.replaceAll("ó", "\u00c3\u00b3");
    converted = converted.replaceAll("Ó", "\u00c3\u0093");
    converted = converted.replaceAll("ú", "\u00c3\u00ba");
    converted = converted.replaceAll("Ú", "\u00c3\u009a");
    converted = converted.replaceAll("ñ", "\u00c3\u00b1");
    converted = converted.replaceAll("Ñ", "\u00c3\u0091");
    converted = converted.replaceAll("€", "\u00c2\u0080");
    converted = converted.replaceAll("¿", "\u00c2\u00bf");
    converted = converted.replaceAll("ª", "\u00c2\u00aa");
    converted = converted.replaceAll("º", "\u00c2\u00b0");
    return converted;
}

You can see the corresponding UTF-8 hex encoding in UTF at http://www.fileformat.info/info/charset/UTF-8/list.htm if you need to include some other characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM