简体   繁体   English

加载 UTF-8 时出现数据库错误 Java 中的编码 XML 数据

[英]Database error while loading UTF-8 Encoded XML Data in Java

I am looking at brainstorming my problem here, not sure if this will trigger loads of shut down or not!我正在在这里集思广益我的问题,不确定这是否会触发大量关闭!

Simplified: I have a system that reads an XML file and loads it into a database.简化:我有一个系统读取 XML 文件并将其加载到数据库中。

The XML has a schema with the following: XML 具有以下架构:

<?XML version="1.0" encoding="UTF-8"?>

The culprit field has the following schema excerpt:罪魁祸首字段具有以下架构摘录:

<xsd:simpleType name="title">
.....
<xsd:restriction base="xsd:string">
 <xsd:minLength value="1"/>
 <xsd:maxLength value="2000"/>
</xsd:restriction>

The schema is UTF-8 compliant, so should support 2000 UTF-8 characters whether they are single or double-byte or multiple bytes.该模式与 UTF-8 兼容,因此应支持 2000 个 UTF-8 字符,无论它们是单字节、双字节还是多字节。

The XML schema already does a character length check, as defined in the excerpt above. XML 模式已经进行了字符长度检查,如上面摘录中所定义。

The problem is sometimes the XSD validates successfully, but the database insert fails, crashes the server with DB error when some multi-byte UTF-8 characters occur in the 'title' XML field.问题有时是 XSD 验证成功,但数据库插入失败,当“标题”字段中出现一些多字节 UTF-8 字符时,服务器崩溃并出现 DB 错误。

The database 'title' column is defined as `varchar(2000)`

When the database insert operation fails, ops need to manually reduce the length of the XML field and re-process XML file to fix it.当数据库插入操作失败时,ops需要手动减少XML字段的长度并重新处理XML文件来修复它。

I have been researching about:我一直在研究:

  • byte vs character length check字节与字符长度检查
  • schema validation模式验证
  • etc ETC

Could the solution be doing a string byte count check which matches the character count?解决方案是否可以进行与字符计数匹配的字符串字节计数检查?

I can do a string.getBytes("UTF-8").length in Java, but how would that match the <xsd:maxLength value="2000"/> in the XSD and the varchar(2000) ?我可以在 Java 中执行 string.getBytes("UTF-8").length ,但是如何匹配 XSD 和varchar(2000)中的<xsd:maxLength value="2000"/>

What would you suggest as the best way to ensure the XML data for the title field does not exceed a specified length, as defined in XSD.您建议如何确保标题字段的 XML 数据不超过指定长度,如 XSD 中定义的那样。 And that the XML data is successfully inserted into the DB as long as XSD is conformed to?并且只要符合 XSD 就可以成功地将 XML 数据插入 DB?

Am I right in assuming a <xsd:maxLength value="2000"/> in the XSD matches the varchar(2000) column definition?我是否正确假设 XSD 中的<xsd:maxLength value="2000"/>varchar(2000)列定义匹配?

The schema is UTF-8 compliant该架构符合 UTF-8

Not exactly, but I think I know what you mean.不完全是,但我想我知道你的意思。 The XML declaration that you quoted is not specifying anything about the XML instance documents that match this schema.您引用的 XML 声明未指定与此架构匹配的 XML 实例文档的任何内容。 It is simply saying that the XSD itself (ie the XML document with root tag <xs:schema>) uses UTF-8 as its character encoding.简单来说就是 XSD本身(即带有根标签 <xs:schema> 的 XML 文档)使用 UTF-8 作为其字符编码。

XML Schema never concerns itself with the raw bytes of the XML document. XML 架构从不关心 XML 文档的原始字节。 It is the XML info set that is being validated.正在验证的是 XML 信息集。 So the maxLength facet on the simple type is saying that you can have up to 2000 characters in this field.因此,简单类型的 maxLength 方面表示您在此字段中最多可以有 2000 个字符 As you rightly point out, the actual length in bytes could easily exceed 2000 characters, but the XML processor will not know or care.正如您正确指出的那样,字节的实际长度很容易超过 2000 个字符,但 XML 处理器不会知道或关心。

sometimes the XSD validates successfully, but the database insert fails有时 XSD 验证成功,但数据库插入失败

I agree with lunatikz - the most likely explanation is that the DB is incorrectly configured.我同意 lunatikz - 最可能的解释是数据库配置不正确。

Could the solution be doing a string byte count check which matches the character count?解决方案是否可以进行与字符计数匹配的字符串字节计数检查?

No, that would be fixing the wrong problem.不,那将解决错误的问题。 The problem is probably in the database, not in your Java code.问题可能出在数据库中,而不是在您的 Java 代码中。

What would you suggest as the best way to ensure the XML data for the title field does not exceed a specified length, as defined in XSD.您建议如何确保标题字段的 XML 数据不超过指定长度,如 XSD 中定义的那样。

I don't think you need to do anything to ensure that.我认为你不需要做任何事情来确保这一点。 Your XML validator is already checking that for you, and it's probably working just fine.您的 XML 验证器已经在为您检查,它可能工作得很好。

And that the XML data is successfully inserted into the DB as long as XSD is conformed to?并且只要符合 XSD 就可以成功地将 XML 数据插入 DB?

Configure the DB or its table/column definition so that it stops trying to interpret the input using a single-byte character encoding.配置数据库或其表/列定义,使其停止尝试使用单字节字符编码来解释输入。

Am I right in assuming a <xsd:maxLength value="2000"/> in the XSD matches the varchar(2000) column definition?我是否正确假设 XSD 中的 <xsd:maxLength value="2000"/> 与 varchar(2000) 列定义匹配?

Yes, both are specifying a field with up to 2000 characters.是的,两者都指定了最多 2000 个字符的字段。 But the database interprets the word 'character' in a different way from the XML processor.但是数据库以与 XML 处理器不同的方式解释“字符”一词。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM