简体   繁体   English

2字节UTF-8序列的无效字节2:如何查找字符

[英]Invalid byte 2 of 2-byte UTF-8 sequence : How to find the character

I have a big text file on my windows machine in UTF-8 encoding. 我的Windows机器上有一个采用UTF-8编码的大文本文件。 Somehow one or more of the characters in this file are invalid for UTF-8 encoding, giving error as "Invalid byte 2 of 2-byte UTF-8 sequence". 不知何故,此文件中的一个或多个字符对于UTF-8编码无效,并给出错误消息“ 2字节UTF-8序列的无效字节2”。

I am using windows 7, and I want to find the character which is invalid. 我正在使用Windows 7,我想找到无效的字符。 I guess there is a UNIX command for this, but is there any tool or utility or regex(something which doesn't need to write a programe/code) which can be used in windows. 我猜有一个用于UNIX的命令,但是是否有可以在Windows中使用的任何工具或实用程序或正则表达式(不需要编写程序/代码的东西)。

I can use notepad++ or PSPAD or similar text editor, or if there is any windows command, I can create a batch file. 我可以使用notepad ++或PSPAD或类似的文本编辑器,或者如果有任何Windows命令,我可以创建一个批处理文件。 Please suggest. 请提出建议。

Create a FileReader to read the file byte by byte. 创建一个FileReader来逐字节读取文件。 If the current byte looks like the first of a 2-byte UTF-8, read the next byte, put the two in a byte[2] array, and give this to new String(array, "UTF-8"). 如果当前字节看起来像2字节UTF-8的第一个字节,请读取下一个字节,将两个字节放入byte [2]数组中,并将其提供给新的String(array,“ UTF-8”)。 In the loop, count the bytes read, and catch the exception to produce the position and byte values. 在循环中,对读取的字节进行计数,并捕获异常以产生位置和字节值。

It's possible that your UTF-8 file has Byte Order Mark on it, which is often not recognised by the Java Readers. 您的UTF-8文件可能带有Byte Order Mark(字节顺序标记),而Java Reader通常无法识别该顺序。

Open the file in Notepad++. 在记事本++中打开文件。 If the file has a BOM, Notepad++ will report "UTF-8" rather than "UTF-8 w/o BOM". 如果文件具有BOM表,则Notepad ++将报告“ UTF-8”而不是“ UTF-8 w / o BOM”。

You can either convert to UTF-8 without BOM or use something like: https://stackoverflow.com/a/2905038/1554386 to strip the BOM. 您可以转换为不带BOM的UTF-8,或使用类似以下内容的东西: https : //stackoverflow.com/a/2905038/1554386剥离BOM。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 2 字节 UTF-8 序列的无效字节 2 - invalid byte 2 of 2-byte UTF-8 sequence MalformedByteSequenceException:2字节UTF-8序列的无效字节2 - MalformedByteSequenceException: Invalid byte 2 of 2-byte UTF-8 sequence Android studio 2字节UTF-8序列的无效字节2 - Android studio Invalid byte 2 of 2-byte UTF-8 sequence JAXB和UTF-8解组异常“ 2字节UTF-8序列的无效字节2” - JAXB & UTF-8 Unmarshal exception “Invalid byte 2 of 2-byte UTF-8 sequence” 2 字节 UTF-8 Java 的无效字节 2,序列错误取决于 Windows/IntelliJ - Invalid byte 2 of 2-byte UTF-8 Java, sequence error depending on Windows/IntelliJ 从URL解析RSS给我“ 2字节UTF-8序列的无效字节2” - Parse RSS from URLs gives me “Invalid byte 2 of 2-byte UTF-8 sequence” Selenium Web驱动程序:MalformedByteSequenceException 2字节UTF-8序列的无效字节2 - Selenium Web Driver : MalformedByteSequenceException Invalid byte 2 of 2-byte UTF-8 sequence 嵌套的异常是org.xml.sax.SAXParseException:2字节UTF-8序列的无效字节2 - nested exception is org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequence 2字节UTF-8序列的无效字节2:XML保存为字符串变量 - Invalid byte 2 of 2-byte UTF-8 sequence: XML saved as String varible 在Windows中使用Java读取UTF-8格式的xml -file会给出“ IOException:2字节UTF-8序列的无效字节2。” -error - Reading xml -file in UTF-8 format in Windows with Java gives “IOException: Invalid byte 2 of 2-byte UTF-8 sequence.” -error
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM