简体   繁体   English

如何在ruby中使用File.foreach使用UTF-8 BOM读取文件

[英]How to read a file with a UTF-8 BOM using File.foreach in ruby

I would like to read a text file with a an UTF-8 Bom using ruby's file.foreach. 我想使用ruby的file.foreach读取带有UTF-8 Bom的文本文件。 The bom is inserted in the file by adding as first line: myFile.write "\" . 通过添加以下内容作为第一行来插入myFile.write "\"myFile.write "\" Download the myFile.txt from this link: 从以下链接下载myFile.txt:

https://wetransfer.com/downloads/b42363faaff561e7e3ca2edbe850d88d20190807164816/c6e4e1 https://wetransfer.com/downloads/b42363faaff561e7e3ca2edbe850d88d20190807164816/c6e4e1

I tried simply to read the file like this 我只是尝试读取这样的文件

  File.foreach(myFile).with_index do |line, line_num|
    puts "line = " + line
    puts "line.bytes = " + line.bytes.to_s()
    puts "line.bytes.map(&:chr) = " + line.bytes.map(&:chr).to_s()
  end

Problem is that line looks like empty every line of the file. 问题是该行看起来像文件的每一行都是空的。 However I can see there is something there using bytes. 但是我可以看到有一些使用字节的东西。 Also tried using encoding argument of File.foreach as follows 还尝试使用File.foreach的编码参数,如下所示

  File.foreach(myFile, :encoding=> 'r:bom|utf-8').with_index do |line,line_num|
    puts "line = " + line
    puts "line.bytes = " + line.bytes.to_s()
    puts "line.bytes.map(&:chr) = " + line.bytes.map(&:chr).to_s()
  end

But I am getting the same results. 但是我得到了相同的结果。 In both cases seems like ruby recognises the utf-8 bom because puts line.encoding results in 'utf-8'. 在这两种情况下,ruby都可以识别utf-8 bom,因为puts line.encoding放入'utf-8'。 But I cannot access the chars of line string as usual. 但是我无法像往常一样访问行字符串的字符。 For example using myFile.txt the condition below is never triggered. 例如,使用myFile.txt不会触发以下条件。

   if line[0,5] == 'Hello'
     puts "Hello catched"
   end  

Do you know how can I read my files using file.foreach command? 您知道如何使用file.foreach命令读取文件吗?

Regards 问候

The problem seems to be that you have a file with mixed encodings of which we can't be sure of the source. 问题似乎是您有一个混合编码的文件,我们无法确定其来源。 So determining how to "read" this data is not as easy as just convert it to whatever. 因此,确定如何“读取”该数据并不像将其转换为任何数据那样容易。 However, you might try this to see what's going on. 但是,您可以尝试这样做以查看发生了什么。

File.read('myFile.txt').encode("Windows-1252", invalid: :replace, undef: :replace)
=> "?Hello\nI may contain UTF-8 characters as D\xF8RBLAD\n"

This may not be a complete answer but you may want to refer to this article which covers ideas about how to go about solving your issue. 这可能不是一个完整的答案,但是您可能要参考这篇文章 ,其中包含有关如何解决问题的想法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM