简体繁体 English

2字节UTF-8序列的无效字节2：如何查找字符

[英]Invalid byte 2 of 2-byte UTF-8 sequence : How to find the character

原文 2015-04-09 17:20:29 8 2 java/ regex/ utf-8

I have a big text file on my windows machine in UTF-8 encoding. 我的Windows机器上有一个采用UTF-8编码的大文本文件。 Somehow one or more of the characters in this file are invalid for UTF-8 encoding, giving error as "Invalid byte 2 of 2-byte UTF-8 sequence". 不知何故，此文件中的一个或多个字符对于UTF-8编码无效，并给出错误消息“ 2字节UTF-8序列的无效字节2”。

I am using windows 7, and I want to find the character which is invalid. 我正在使用Windows 7，我想找到无效的字符。 I guess there is a UNIX command for this, but is there any tool or utility or regex(something which doesn't need to write a programe/code) which can be used in windows. 我猜有一个用于UNIX的命令，但是是否有可以在Windows中使用的任何工具或实用程序或正则表达式（不需要编写程序/代码的东西）。

I can use notepad++ or PSPAD or similar text editor, or if there is any windows command, I can create a batch file. 我可以使用notepad ++或PSPAD或类似的文本编辑器，或者如果有任何Windows命令，我可以创建一个批处理文件。 Please suggest. 请提出建议。

2 个解决方案

Create a FileReader to read the file byte by byte. 创建一个FileReader来逐字节读取文件。 If the current byte looks like the first of a 2-byte UTF-8, read the next byte, put the two in a byte[2] array, and give this to new String(array, "UTF-8"). 如果当前字节看起来像2字节UTF-8的第一个字节，请读取下一个字节，将两个字节放入byte [2]数组中，并将其提供给新的String（array，“ UTF-8”）。 In the loop, count the bytes read, and catch the exception to produce the position and byte values. 在循环中，对读取的字节进行计数，并捕获异常以产生位置和字节值。

It's possible that your UTF-8 file has Byte Order Mark on it, which is often not recognised by the Java Readers. 您的UTF-8文件可能带有Byte Order Mark（字节顺序标记），而Java Reader通常无法识别该顺序。

Open the file in Notepad++. 在记事本++中打开文件。 If the file has a BOM, Notepad++ will report "UTF-8" rather than "UTF-8 w/o BOM". 如果文件具有BOM表，则Notepad ++将报告“ UTF-8”而不是“ UTF-8 w / o BOM”。

You can either convert to UTF-8 without BOM or use something like: https://stackoverflow.com/a/2905038/1554386 to strip the BOM. 您可以转换为不带BOM的UTF-8，或使用类似以下内容的东西： https : //stackoverflow.com/a/2905038/1554386剥离BOM。

2 字节 UTF-8 序列的无效字节 2 - invalid byte 2 of 2-byte UTF-8 sequence

MalformedByteSequenceException：2字节UTF-8序列的无效字节2 - MalformedByteSequenceException: Invalid byte 2 of 2-byte UTF-8 sequence

Android studio 2字节UTF-8序列的无效字节2 - Android studio Invalid byte 2 of 2-byte UTF-8 sequence

JAXB和UTF-8解组异常“ 2字节UTF-8序列的无效字节2” - JAXB & UTF-8 Unmarshal exception “Invalid byte 2 of 2-byte UTF-8 sequence”

2 字节 UTF-8 Java 的无效字节 2，序列错误取决于 Windows/IntelliJ - Invalid byte 2 of 2-byte UTF-8 Java, sequence error depending on Windows/IntelliJ

从URL解析RSS给我“ 2字节UTF-8序列的无效字节2” - Parse RSS from URLs gives me “Invalid byte 2 of 2-byte UTF-8 sequence”

Selenium Web驱动程序：MalformedByteSequenceException 2字节UTF-8序列的无效字节2 - Selenium Web Driver : MalformedByteSequenceException Invalid byte 2 of 2-byte UTF-8 sequence

嵌套的异常是org.xml.sax.SAXParseException：2字节UTF-8序列的无效字节2 - nested exception is org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequence

2字节UTF-8序列的无效字节2：XML保存为字符串变量 - Invalid byte 2 of 2-byte UTF-8 sequence: XML saved as String varible

在Windows中使用Java读取UTF-8格式的xml -file会给出“ IOException：2字节UTF-8序列的无效字节2。” -error - Reading xml -file in UTF-8 format in Windows with Java gives “IOException: Invalid byte 2 of 2-byte UTF-8 sequence.” -error

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 2 字节 UTF-8 序列的无效字节 2 - invalid byte 2 of 2-byte UTF-8 sequence MalformedByteSequenceException：2字节UTF-8序列的无效字节2 - MalformedByteSequenceException: Invalid byte 2 of 2-byte UTF-8 sequence Android studio 2字节UTF-8序列的无效字节2 - Android studio Invalid byte 2 of 2-byte UTF-8 sequence JAXB和UTF-8解组异常“ 2字节UTF-8序列的无效字节2” - JAXB & UTF-8 Unmarshal exception “Invalid byte 2 of 2-byte UTF-8 sequence” 2 字节 UTF-8 Java 的无效字节 2，序列错误取决于 Windows/IntelliJ - Invalid byte 2 of 2-byte UTF-8 Java, sequence error depending on Windows/IntelliJ 从URL解析RSS给我“ 2字节UTF-8序列的无效字节2” - Parse RSS from URLs gives me “Invalid byte 2 of 2-byte UTF-8 sequence” Selenium Web驱动程序：MalformedByteSequenceException 2字节UTF-8序列的无效字节2 - Selenium Web Driver : MalformedByteSequenceException Invalid byte 2 of 2-byte UTF-8 sequence 嵌套的异常是org.xml.sax.SAXParseException：2字节UTF-8序列的无效字节2 - nested exception is org.xml.sax.SAXParseException: Invalid byte 2 of 2-byte UTF-8 sequence 2字节UTF-8序列的无效字节2：XML保存为字符串变量 - Invalid byte 2 of 2-byte UTF-8 sequence: XML saved as String varible 在Windows中使用Java读取UTF-8格式的xml -file会给出“ IOException：2字节UTF-8序列的无效字节2。” -error - Reading xml -file in UTF-8 format in Windows with Java gives “IOException: Invalid byte 2 of 2-byte UTF-8 sequence.” -error

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM