简体   繁体   English

读取文件时如何避免绊倒UTF-8 BOM

[英]How to avoid tripping over UTF-8 BOM when reading files

I'm consuming a data feed that has recently added a Unicode BOM header (U+FEFF), and my rake task is now messed up by it. 我正在使用最近添加了Unicode BOM标头(U + FEFF)的数据馈送,现在我的rake任务被它弄乱了。

I can skip the first 3 bytes with file.gets[3..-1] but is there a more elegant way to read files in Ruby which can handle this correctly, whether a BOM is present or not? 我可以使用file.gets[3..-1]跳过前3个字节,但是有没有一种更优雅的方式来读取Ruby中的文件,无论是否存在BOM表,它都可以正确处理。

With ruby 1.9.2 you can use the mode r:bom|utf-8 在ruby 1.9.2中,您可以使用模式r:bom|utf-8

text_without_bom = nil #define the variable outside the block to keep the data
File.open('file.txt', "r:bom|utf-8"){|file|
  text_without_bom = file.read
}

or 要么

text_without_bom = File.read('file.txt', encoding: 'bom|utf-8')

or 要么

text_without_bom = File.read('file.txt', mode: 'r:bom|utf-8')

It doesn't matter, if the BOM is available in the file or not. 物料清单中的BOM是否可用都没有关系。


You may also use the encoding option with other commands: 您也可以将encoding选项与其他命令一起使用:

text_without_bom = File.readlines(@filename, "r:utf-8")

(You get an array with all lines). (您将获得一个包含所有行的数组)。

Or with CSV: 或使用CSV:

require 'csv'
CSV.open(@filename, 'r:bom|utf-8'){|csv|
  csv.each{ |row| p row }
}

I wouldn't blindly skip the first three bytes; 我不会盲目地跳过前三个字节。 what if the producer stops adding the BOM again? 如果生产者停止再次添加BOM,该怎么办? What you should do is examine the first few bytes, and if they're 0xEF 0xBB 0xBF, ignore them. 您应该做的是检查前几个字节,如果它们是0xEF 0xBB 0xBF,则忽略它们。 That's the form the BOM character (U+FEFF) takes in UTF-8; 这就是BOM字符(U + FEFF)采用UTF-8的形式; I prefer to deal with it before trying to decode the stream because BOM handling is so inconsistent from one language/tool/framework to the next. 我更喜欢在尝试解码流之前处理它,因为从一种语言/工具/框架到另一种语言/工具/框架,BOM处理是如此不一致。

In fact, that's how you're supposed to deal with a BOM. 事实上,这就是你应该如何处理一个BOM。 If a file has been served as UTF-16, you have to examine the first two bytes before you start decoding so you know whether to read it as big-endian or little-endian. 如果文件已用作UTF-16,则必须在开始解码之前检查前两个字节,以便知道将其读取为big-endian还是little-endian。 Of course, the UTF-8 BOM has nothing to do with byte order, it's just there to let you know that the encoding is UTF-8, in case you didn't already know that. 当然,UTF-8 BOM与字节顺序无关,只是在这里让您知道编码是UTF-8,以防万一您不知道。

I'd not "trust" some file to be encoded as UTF-8 when a BOM of 0xEF 0xBB 0xBF is present, you might fail. 当物料清单为0xEF 0xBB 0xBF时,我不“信任”某些文件编码为UTF-8,您可能会失败。 Usually when detecting the UTF-8 BOM, it should really be a UTF-8 encoded file of course. 通常,在检测UTF-8 BOM时,当然应该确实是一个UTF-8编码的文件。 But, if for example someone has just added the UTF-8 BOM to an ISO file, you'd fail to encode such file so bad if there are bytes in it that are above 0x0F. 但是,例如,如果有人刚刚将UTF-8 BOM添加到ISO文件中,则如果文件中的字节大于0x0F,您将无法对这种文件进行编码,因此效果很差。 You can trust the file if you have only bytes up to 0x0F inside, because in this case it's a UTF-8 compatible ASCII file and at the same time it is a valid UTF-8 file. 如果内部只有不超过0x0F的字节,则可以信任该文件,因为在这种情况下,它是UTF-8兼容的ASCII文件,同时也是有效的UTF-8文件。

If there are not just bytes <= 0x0F within the file (after the BOM), to be sure it is properly UTF-8 encoded you'll have to check for valid sequences and - even when all sequences are valid - check also if each codepoint from a sequence uses the shortest sequence possible and check also if there is no codepoint that matches a high- or low-surrogate. 如果文件中(在BOM表之后)不仅只有字节<= 0x0F,为了确保它已正确地以UTF-8编码,您还必须检查有效的序列,并且-即使所有序列均有效-还要检查每个序列中的代码点使用可能的最短序列,并检查是否没有匹配高或低代理的代码点。 Also check if the maximum bytes of a sequence is not more than 4 and the highest codepoint is 0x10FFFF. 还要检查序列的最大字节数是否不超过4,并且最高代码点为0x10FFFF。 The highest codepoint limits also the startbyte's payload bits to be not higher than 0x4 and the first following byte's payload not higher than 0xF. 最高代码点还将起始字节的有效载荷位限制为不大于0x4,而随后的第一个字节的有效载荷不大于0xF。 If all the mentioned checks pass successfully, your UTF-8 BOM tells the truth. 如果上述所有检查均成功通过,则您的UTF-8 BOM会说实话。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM