简体   繁体   English

了解典型Java Web应用程序中的字符编码

[英]Understanding character encoding in typical Java web app

Some pseudocode: 一些伪代码:

String a = "A bunch of text"; //UTF-16
saveTextInDb(a); //Write to Oracle VARCHAR(15) column
String b = readTextFromDb(); //UTF-16
out.write(b); //Write to http response

When you save the Java String (UTF-16) to Oracle VARCHAR(15) does Oracle also store this as UTF-16? 将Java String (UTF-16)保存到Oracle VARCHAR(15)时,Oracle是否也将其存储为UTF-16? Does the length of an Oracle VARCHAR refer to number of Unicode characters (and not number of bytes)? Oracle VARCHAR的长度是指Unicode字符数(而不是字节数)吗?

When we write b to the ServletResponse is this being written as UTF-16 or are we by default converting to another encoding like UTF-8? 当我们写bServletResponse是这样被写入为UTF-16还是我们通过缺省转换为像UTF-8另一种编码?

Instead of UTF-16, think of 'internal representation' of your string. 而不是UTF-16,想一想你的字符串的“内部表示”。 A string in Java is some sort of characters, you don't care which encoding is used internally. Java中的字符串是某种字符,您不关心在内部使用哪种编码。 Encoding becomes relevant, if you interact with the outside of the program. 如果您与程序外部进行交互,则编码变得相关。 In your example saveTextInDb, readTextFromDb and write do that. 在您的示例saveTextInDb中,readTextFromDb和write执行此操作。 Every time you exchange strings with the outside, an encoding for conversion is used. 每次与外部交换字符串时,都会使用转换编码。 saveTextInDb (and read) look like self-made methods, at least I don't know them. saveTextInDb(和read)看起来像是自制的方法,至少我不知道它们。 So you should look up, which encoding is used for this methods. 所以你应该查找,这种方法使用哪种编码。 The method write of a Writer always creates bytes, that represent an encoding associated with the writer. Writer的方法写入总是创建字节,表示与编写器关联的编码。 If you get your Writer from a HttpServletResponse, the encoding associated is the one used for outputting the response (that will be send in the headers). 如果从HttpServletResponse获取Writer,则相关的编码是用于输出响应的编码(将在头文件中发送)。

response.setEncoding("UTF-8");
Writer out = response.getWriter();

This code returns with out a Writer, that translates the strings into UTF-8-encoding. 此代码返回一个Writer,它将字符串转换为UTF-8编码。 Similar if you write to a file: 如果您写入文件,则类似:

Writer fileout = new OutputStreamWriter(new FileOutputStream(myfile), "ISO8859-1");

If you access a DB, the framework you use should ensure a consistent exchange of strings with the database. 如果访问数据库,则使用的框架应确保字符串与数据库的一致交换。

The ability of Oracle to store (and later retrieve) Unicode text from the database relies only on the character set of the database (usually specified during database creation). Oracle从数据库中存储(以及稍后检索)Unicode文本的能力仅依赖于数据库的字符集(通常在数据库创建期间指定)。 Choosing AL32UTF8 as the character set is recommended for storage of Unicode text in CHAR datatypes (including VARCHAR/VARCHAR2), for it will enable you to access all of the Unicode codepoints while not consuming a lot of storage space compared to other encodings like AL16UTF16/AL32UTF32. 建议选择AL32UTF8作为字符集,以便在CHAR数据类型(包括VARCHAR / VARCHAR2)中存储Unicode文本,因为它可以访问所有Unicode代码点,而不像AL16UTF16 /其他编码那样消耗大量存储空间AL32UTF32。

Assuming this is done, it is the Oracle JDBC driver that is responsible for conversion of UTF-16 encoded data into AL32UTF8. 假设已完成此操作,则Oracle JDBC驱动程序负责将UTF-16编码数据转换为AL32UTF8。 This "automatic" conversion between encodings also happens when data is read from the database. 编码之间的这种“自动”转换也发生在从数据库读取数据时。 To answer the query on byte length of VARCHAR, the definition of a VARCHAR2 column in Oracle involves byte semantics - VARCHAR2(n) is used to define a column that can store n bytes (this is the default behavior, as specified by the NLS_LENGTH_SEMANTICS parameter of the database); 要回答有关VARCHAR字节长度的查询,Oracle中VARCHAR2列的定义涉及字节语义 - VARCHAR2(n)用于定义可以存储n个字节的列(这是默认行为,由NLS_LENGTH_SEMANTICS参数指定数据库); if you need to define the size based on characters VARCHAR2(n CHAR) is to be used. 如果需要根据字符定义大小,则使用VARCHAR2(n CHAR)。

The encoding of the data written to the ServletResponse object, depends on the default character encoding, unless this is specified via the ServletResponse.setCharacterEncoding() or ServletResponse.setContentType() API calls. 写入ServletResponse对象的数据的编码取决于默认的字符编码,除非通过ServletResponse.setCharacterEncoding()ServletResponse.setContentType() API调用指定。 All in all, for a complete Unicode solution involving an Oracle database, one must have knowledge of 总而言之,对于涉及Oracle数据库的完整Unicode解决方案,必须具备相关知识

  1. The encoding of the incoming data (ie the encoding of the data read via the ServletRequest object). 传入数据的编码(即通过ServletRequest对象读取的数据的编码)。 This can be done via specifying the accepted encoding in the HTML forms via the accept-charset attribute . 这可以通过accept-charset属性在HTML表单中指定接受的编码来完成。 If the encoding is unknown, the application could attempt to set it to a known value via the ServletRequest.setCharacterEncoding() method. 如果编码未知,则应用程序可以尝试通过ServletRequest.setCharacterEncoding()方法将其设置为已知值。 This method doesn't change the existing encoding of characters in the stream. 此方法不会更改流中字符的现有编码。 If the input stream is in ISO-Latin1, specifying a different encoding will most likely result in an exception being thrown. 如果输入流是ISO-Latin1,则指定不同的编码很可能会导致抛出异常。 Knowing the encoding is important, since the Java runtime libraries will require knowledge of the original encoding of the stream, if the contents of the stream are to be treated as character primitives or Strings. 知道编码很重要,因为Java运行时库需要知道流的原始编码,如果要将流的内容视为字符基元或字符串。 Apparently, this is required when you invoke ServletRequest.getParameter or similar methods that will process the stream and return String objects. 显然,当您调用ServletRequest.getParameter或类似的方法来处理流并返回String对象时,这是必需的。 The decoding process will result in creation of characters in the platform encoding (this is UTF-16). 解码过程将导致在平台编码中创建字符(这是UTF-16)。
  2. The encoding of the data read from streams, as opposed to data created with in the JVM. 从流中读取的数据的编码,而不是在JVM中创建的数据。 This is quite important, since the encoding of data read from streams, cannot be changed. 这非常重要,因为从流中读取的数据的编码不能改变。 There is however, a decoding process that will convert characters in supported encodings to UTF-16 characters, whenever such data is accessed as a character primitive or as a String. 然而,只要这些数据作为字符基元或字符串被访问,就会有一个解码过程将支持的编码中的字符转换为UTF-16字符。 New String objects on the other hand, can be created with a defined encoding. 另一方面,可以使用定义的编码创建新的String对象。 This matters when you write the contents of the stream out onto another stream (the HttpServletResponse object's output stream for instance). 当您将流的内容写入另一个流(例如HttpServletResponse对象的输出流)时,这很重要。 If the contents of the input stream are being treated as a sequence of bytes, and not as characters or Strings, then no decoding operation will be undertaken by the JVM. 如果输入流的内容被视为字节序列,而不是字符或字符串,则JVM不会执行解码操作。 This would imply that the bytes written to the output stream must not be altered if intermediate character or String objects are not created. 这意味着如果未创建中间字符或String对象,则不得更改写入输出流的字节。 Otherwise, it is quite possible that the contents of the output stream will be malformed and parsed incorrectly by a corresponding decoder. 否则,很可能输出流的内容将被错误地形成并被相应的解码器错误地解析。 In simpler words, 用简单的话说,

    • if one is writing String objects or characters to the servlet's output stream, then one must specify the encoding that the browser must use to decode the response. 如果要将String对象或字符写入servlet的输出流,则必须指定浏览器必须使用的编码来解码响应。 Appropriate encoders might be used to encode the sequence of characters as specified in the desired response. 可以使用适当的编码器来编码所需响应中指定的字符序列。
    • if one is writing a sequence of bytes that will be interpreted as characters, then the encoding to be specified in the HTTP header must be known before hand 如果一个人正在编写一个将被解释为字符的字节序列,那么必须先知道在HTTP头中指定的编码
    • if one is writing a sequence of bytes that will be parsed as a sequence of bytes (for images and other binary data), then the concept of encoding is immaterial. 如果一个人正在编写一个字节序列,将被解析为一个字节序列(对于图像和其他二进制数据),那么编码的概念就不重要了。
  3. The database character set of the Oracle instance. Oracle实例的数据库字符集。 As indicated previously, data will be stored in the Oracle database, in the defined character set (for CHAR datatypes). 如前所述,数据将以定义的字符集(对于CHAR数据类型)存储在Oracle数据库中。 The Oracle JDBC driver takes care of conversion of data between UTF-16 and AL32UTF8 (the database character set in this case) for CHAR and NCHAR datatypes. 对于CHAR和NCHAR数据类型,Oracle JDBC驱动程序负责UTF-16和AL32UTF8(本例中为数据库字符集)之间的数据转换。 When you invoke resultSet.getString() , a String with UTF-16 characters is being returned by the JDBC driver. 调用resultSet.getString() ,JDBC驱动程序将返回具有UTF-16字符的String。 The converse is true, when you send data to the database too. 当您将数据发送到数据库时,反之亦然。 If another database character set is used, an additional level of conversion (from the UTF-16 to UTF-8 to the database character set) is performed transparently by the JDBC driver. 如果使用其他数据库字符集,则JDBC驱动程序将透明地执行其他级别的转换(从UTF-16到UTF-8再到数据库字符集)。

The ServletResponse will use ISO 8859-1 (Latin 1) by default. ServletResponse默认使用ISO 8859-1(Latin 1)。 UTF-8 is the most common encoding used for HTTP responses that require Unicode, but you have to set that encoding specifically. UTF-8是用于需要Unicode的HTTP响应的最常用编码,但您必须专门设置该编码。

According to this document Oracle can support either UTF-8 or UTF-16 in the database. 根据该文档, Oracle可以在数据库中支持UTF-8或UTF-16。 Your methods that read/write Oracle will need to use the appropriate encoding that matches how the database is set up, and translate that to/from the Java internal representation. 您的读/写Oracle方法需要使用与数据库设置方式相匹配的相应编码,并将其转换为Java内部表示形式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM