[英]Invalid byte 2 of 2-byte UTF-8 sequence : How to find the character
I have a big text file on my windows machine in UTF-8 encoding. 我的Windows机器上有一个采用UTF-8编码的大文本文件。 Somehow one or more of the characters in this file are invalid for UTF-8 encoding, giving error as "Invalid byte 2 of 2-byte UTF-8 sequence".
不知何故,此文件中的一个或多个字符对于UTF-8编码无效,并给出错误消息“ 2字节UTF-8序列的无效字节2”。
I am using windows 7, and I want to find the character which is invalid. 我正在使用Windows 7,我想找到无效的字符。 I guess there is a UNIX command for this, but is there any tool or utility or regex(something which doesn't need to write a programe/code) which can be used in windows.
我猜有一个用于UNIX的命令,但是是否有可以在Windows中使用的任何工具或实用程序或正则表达式(不需要编写程序/代码的东西)。
I can use notepad++ or PSPAD or similar text editor, or if there is any windows command, I can create a batch file. 我可以使用notepad ++或PSPAD或类似的文本编辑器,或者如果有任何Windows命令,我可以创建一个批处理文件。 Please suggest.
请提出建议。
Create a FileReader to read the file byte by byte. 创建一个FileReader来逐字节读取文件。 If the current byte looks like the first of a 2-byte UTF-8, read the next byte, put the two in a byte[2] array, and give this to new String(array, "UTF-8").
如果当前字节看起来像2字节UTF-8的第一个字节,请读取下一个字节,将两个字节放入byte [2]数组中,并将其提供给新的String(array,“ UTF-8”)。 In the loop, count the bytes read, and catch the exception to produce the position and byte values.
在循环中,对读取的字节进行计数,并捕获异常以产生位置和字节值。
It's possible that your UTF-8 file has Byte Order Mark on it, which is often not recognised by the Java Readers. 您的UTF-8文件可能带有Byte Order Mark(字节顺序标记),而Java Reader通常无法识别该顺序。
Open the file in Notepad++. 在记事本++中打开文件。 If the file has a BOM, Notepad++ will report "UTF-8" rather than "UTF-8 w/o BOM".
如果文件具有BOM表,则Notepad ++将报告“ UTF-8”而不是“ UTF-8 w / o BOM”。
You can either convert to UTF-8 without BOM or use something like: https://stackoverflow.com/a/2905038/1554386 to strip the BOM. 您可以转换为不带BOM的UTF-8,或使用类似以下内容的东西: https : //stackoverflow.com/a/2905038/1554386剥离BOM。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.