[英]Java: Search in a wrong encoded String without modifying it
I have to find a user-defined String in a Document (using Java ), which is stored in a database in a BLOB. 我必须在Document(使用Java )中找到用户定义的字符串,该字符串存储在BLOB的数据库中。 When I search a String with special characters ("Umlaute", äöü etc.), it failes, meaning it does not return any positions at all. 当我搜索带有特殊字符(“ Umlaute”,äöü等)的字符串时,它失败,这意味着它根本不返回任何位置。 And I am not allowed to convert the document's content into UTF-8 (which would have fixed this problem but raised a new, even bigger one). 而且不允许我将文档的内容转换为UTF-8(本来可以解决此问题,但是提出了一个新的甚至更大的问题)。
Some additional information: The document's content is returned as String in "ISO-8859-1" (Latin1). 一些其他信息:文档的内容在“ ISO-8859-1”(Latin1)中作为字符串返回。 Here is an example, what a String could look like: 这是一个示例,字符串看起来像什么:
Die Erkenntnis, daà der Künstler Schutz braucht, ...
This is how it should look like: 它应该是这样的:
Die Erkenntnis, daß der Künstler Schutz braucht, ...
If I am searching for Künstler
it would fail to find it, because it looks for ü
but only finds ü
. 如果我正在搜索Künstler
,它将找不到它,因为它查找ü
但只能找到ü
。
Is it possible to convert Künstler
into Künstler
so I can search for the wrong encoded version instead? 是否可以将Künstler
转换为Künstler
以便我可以搜索错误的编码版本?
Note: We are using the Hibernate Framework for Database access. 注意:我们正在使用Hibernate Framework进行数据库访问。 The original Getter for the Document's Content returns a byte[]
. 文档内容的原始Getter返回byte[]
。 The String is than returned by calling 然后通过调用返回字符串
new String(getContent(), "ISO-8859-1")
The problem here is, that I cannot change this to UTF-8, because it would then mess up the rest of our application which is based on a third party application that delivers data this way. 这里的问题是,我无法将其更改为UTF-8,因为这将使我们的应用程序的其余部分混乱,该应用程序基于以这种方式传递数据的第三方应用程序。
Okay, looks like I've found a way to mess up the encoding on purpose. 好的,看来我已经找到了一种故意弄乱编码的方法。
new String("Künstler".getBytes("UTF-8"), "ISO-8859-1")
By getting the Bytes of the String Künstler
in UTF-8 and then creating a new String, telling Java that this is Latin1, it converts to Künstler
. 通过获取UTF-8中的字符串Künstler
的字节,然后创建一个新的String,告诉Java这是Latin1,它将转换为Künstler
。 It's a hell of a hack but seems to work well. 这真是个骇人听闻的事情,但似乎运作良好。
Already answered by yourself. 已由您自己回答。
An altoghether different approach: If you can search the blob, you could search using 完全不同的方法:如果可以搜索Blob,则可以使用
"SELECT .. FROM ... WHERE"
+ " ... LIKE '%" + key.replaceAll("\\P{Ascii}+", "%") + "%'"
This replaces non-ASCII sequences by the %
wildcard: UTF-8 multibyte sequences are non-ASCII by design. 这将用%
通配符替换非ASCII序列:UTF-8多字节序列在设计上是非ASCII的。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.