简体   繁体   English

Java:搜索错误的编码字符串而不修改它

[英]Java: Search in a wrong encoded String without modifying it

I have to find a user-defined String in a Document (using Java ), which is stored in a database in a BLOB. 我必须在Document(使用Java )中找到用户定义的字符串,该字符串存储在BLOB的数据库中。 When I search a String with special characters ("Umlaute", äöü etc.), it failes, meaning it does not return any positions at all. 当我搜索带有特殊字符(“ Umlaute”,äöü等)的字符串时,它失败,这意味着它根本不返回任何位置。 And I am not allowed to convert the document's content into UTF-8 (which would have fixed this problem but raised a new, even bigger one). 而且不允许我将文档的内容转换为UTF-8(本来可以解决此问题,但是提出了一个新的甚至更大的问题)。

Some additional information: The document's content is returned as String in "ISO-8859-1" (Latin1). 一些其他信息:文档的内容在“ ISO-8859-1”(Latin1)中作为字符串返回。 Here is an example, what a String could look like: 这是一个示例,字符串看起来像什么:

Die Erkenntnis, daà der Künstler Schutz braucht, ...

This is how it should look like: 它应该是这样的:

Die Erkenntnis, daß der Künstler Schutz braucht, ...

If I am searching for Künstler it would fail to find it, because it looks for ü but only finds ü . 如果我正在搜索Künstler ,它将找不到它,因为它查找ü但只能找到ü

Is it possible to convert Künstler into Künstler so I can search for the wrong encoded version instead? 是否可以将Künstler转换为Künstler以便我可以搜索错误的编码版本?

Note: We are using the Hibernate Framework for Database access. 注意:我们正在使用Hibernate Framework进行数据库访问。 The original Getter for the Document's Content returns a byte[] . 文档内容的原始Getter返回byte[] The String is than returned by calling 然后通过调用返回字符串

new String(getContent(), "ISO-8859-1")

The problem here is, that I cannot change this to UTF-8, because it would then mess up the rest of our application which is based on a third party application that delivers data this way. 这里的问题是,我无法将其更改为UTF-8,因为这将使我们的应用程序的其余部分混乱,该应用程序基于以这种方式传递数据的第三方应用程序。

Okay, looks like I've found a way to mess up the encoding on purpose. 好的,看来我已经找到了一种故意弄乱编码的方法。

new String("Künstler".getBytes("UTF-8"), "ISO-8859-1")

By getting the Bytes of the String Künstler in UTF-8 and then creating a new String, telling Java that this is Latin1, it converts to Künstler . 通过获取UTF-8中的字符串Künstler的字节,然后创建一个新的String,告诉Java这是Latin1,它将转换为Künstler It's a hell of a hack but seems to work well. 这真是个骇人听闻的事情,但似乎运作良好。

Already answered by yourself. 已由您自己回答。

An altoghether different approach: If you can search the blob, you could search using 完全不同的方法:如果可以搜索Blob,则可以使用

"SELECT .. FROM ... WHERE"
+ " ... LIKE '%" + key.replaceAll("\\P{Ascii}+", "%") + "%'"

This replaces non-ASCII sequences by the % wildcard: UTF-8 multibyte sequences are non-ASCII by design. 这将用%通配符替换非ASCII序列:UTF-8多字节序列在设计上是非ASCII的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM