简体   繁体   English

在Java中将非英语字符串转换为普通字符串

[英]Convert non english string to normal String in Java

I am required to validate certain text against some baselines. 我需要根据某些基准来验证某些文本。

For ex: 例如:

String a="La Panthère"; 
String b="La Panthère";

I know that string b contains HTML literals so I am using Apache StringEscapeUtils which gives me 我知道字符串b包含HTML文字,因此我正在使用Apache StringEscapeUtils ,它给了我

String b="La Panthère";
b=StringEscapeUtils.unescapeHtml(b);

Output:- La Panthère 输出:-LaPanthère

However I do not know whats stored in string a. 但是我不知道字符串a中存储了什么。 Somewhere from SO I got to know that this might be ascent literals and hence tried below code 从SO的某个地方,我知道这可能是上升文字,因此在下面的代码中进行了尝试

a=Normalizer.normalize(a, Normalizer.Form.NFKD);

Note: I tried all forms of Normalizer but nothing worked. 注意:我尝试了所有形式的Normalizer,但没有任何效果。

Can some one please help me in how to make String a in same fashion as that of b ? 有人可以帮我如何使String a与b相同的方式吗?

As Jesper mentions, the è pattern typically indicates a mis-encoding. 正如Jesper所提到的, è模式通常表示编码错误。

At that point, you're already out of luck. 那时,您已经不走运了。

Remedial actions such as replacing the è are not advisable, nor safe. 建议不要采取补救措施,例如更换è ,也不安全。

Escaping or normalizing the String is out of scope, as your problem is at the source and has nothing to do with HTML conversion or accent normalization. String转义或标准化超出了范围,因为问题出在源头,与HTML转换或口音规范化无关。

However, there are simple idioms to convert the String to a different encoding. 但是,有一些简单的习惯用法可以将String转换为不同的编码。

The example below: 下面的例子:

  • simulates a Windows-1252 String (in a UTF-8 environment). 模拟Windows-1252 String (在UTF-8环境中)。
  • then, it prints it as is (corrupted, since it's a Windows-1252 String in a UTF-8 print stream). 然后,它按原样打印(损坏,因为它是UTF-8打印流中的Windows-1252 String )。
  • finally, it prints it re-converted to UTF-8. 最后,将其打印出来并重新转换为UTF-8。

     String a = new String( "La Panthère".getBytes(Charset.forName("UTF-8")), Charset.forName("Cp1252") ); System.out.println(a); System.out.println( new String( a.getBytes(Charset.forName("Cp1252")), Charset.forName("UTF-8") ) ); 

Output 产量

La Panthère
La Panthère

Notes 笔记

The conversion idiom described above implies you know how the original String is encoded beforehand. 上面描述的转换习惯用法意味着您知道如何预先对原始String进行编码。

Typical encoding issues take place when the following encoding are used to interpret text in one another: 当以下编码用于相互解释文本时,就会发生典型的编码问题:

  • ISO Latin 1 ISO拉丁语1
  • Windows-1252 Windows的1252
  • UTF-8 UTF-8

Here 'sa list of Java-supported encodings along with their canonical names. 这里是Java支持的编码及其规范名称的列表。

In a web context, you'd typically invoke Javascript's encodeURIComponent function to encode your values in the front-end, before sending them to the back-end. 在网络环境中,通常需要先调用Javascript的encodeURIComponent函数在前端将值编码,然后再将其发送到后端。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM