简体   繁体   中英

Parsing a Content-Type header in Java without validating the charset

Given an HTTP header like:

Content-Type: text/plain; charset=something

I'd like to extract the MIME type and charset using full RFC-compliant parsing, but without "validating" the charset. By validating, I mean that I don't want to use Java's internal Charset mechanism, in case the charset is unknown to Java (but may still have meaning for other applications). The following code does not work because it does this validation:

import org.apache.http.entity.ContentType;

String header = "text/plain; charset=something";

ContentType contentType = ContentType.parse(header);
Charset contentTypeCharset = contentType.getCharset();

System.out.println(contentType.getMimeType());
System.out.println(contentTypeCharset == null ? null : contentTypeCharset.toString());

This throws java.nio.charset.UnsupportedCharsetException: something .

To do the parsing one can use lower-level parsing classes:

import org.apache.http.HeaderElement;
import org.apache.http.NameValuePair;
import org.apache.http.message.BasicHeaderValueParser;

String header = "text/plain; charset=something";

HeaderElement headerElement = BasicHeaderValueParser.parseHeaderElement(header, null);
String mimeType = headerElement.getName();
String charset = null;
for (NameValuePair param : headerElement.getParameters()) {
    if (param.getName().equalsIgnoreCase("charset")) {
        String s = param.getValue();
        if (!StringUtils.isBlank(s)) {
            charset = s;
        }
        break;
    }
}

System.out.println(mimeType);
System.out.println(charset);

Alternatively one can still use the Apache's parse and catch the UnsupportedCharsetException for extracting the name using getCharsetName()

import org.apache.http.entity.ContentType;

String header = "text/plain; charset=something";

String charsetName;
String mimeType;

try {
  ContentType contentType = ContentType.parse(header); // here exception may be thrown
   mimeType = contentType.getMimeType();
   Charset charset = contentType.getCharset();
   charsetName = charset != null ? charset.name() : null;
} catch( UnsupportedCharsetException e) {
    charsetName = e.getCharsetName(); // extract unsupported charsetName
    mimeType = header.substring(0, header.indexOf(';')); // in case of exception, mimeType needs to be parsed separately
}

Drawback is that mimeType also needs to be extracted differently in case of UnsupportedCharsetException.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM