简体   繁体   English

使用Ruby解析XLS和XLSX(MS Excel)文件?

[英]Parsing XLS and XLSX (MS Excel) files with Ruby?

Are there any gems able to parse XLS and XLSX files? 有没有可以解析XLS和XLSX文件的gem? I've found Spreadsheet and ParseExcel, but they both don't understand XLSX format. 我找到了Spreadsheet和ParseExcel,但是它们都不了解XLSX格式。

I recently needed to parse some Excel files with Ruby. 我最近需要用Ruby解析一些Excel文件。 The abundance of libraries and options turned out to be confusing, so I wrote a blog post about it. 原来,丰富的库和选项令人困惑,所以我写了一篇关于它的博客文章

Here is a table of different Ruby libraries and what they support: 这是不同的Ruby库及其支持的表:

在此处输入图片说明

If you care about performance, here is how the xlsx libraries compare: 如果您关心性能,则xlsx库的比较方式如下: 在此处输入图片说明

I have sample code to read xlsx files with each supported library here 我的示例代码读取XLSX与每个支持库文件在这里

Here are some examples for reading xlsx files with some different libraries: 以下是一些使用一些不同库读取xlsx文件的示例:

rubyXL 红宝石XL

require 'rubyXL'

workbook = RubyXL::Parser.parse './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.sheet_name}"
  num_rows = 0
  worksheet.each do |row|
    row_cells = row.cells.map{ |cell| cell.value }
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

roo 袋鼠

require 'roo'

workbook = Roo::Spreadsheet.open './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet}"
  num_rows = 0
  workbook.sheet(worksheet).each_row_streaming do |row|
    row_cells = row.map { |cell| cell.value }
    num_rows += 1
  end
  puts "Read #{num_rows} rows" 
end

creek

require 'creek'

workbook = Creek::Book.new './sample_excel_files/xlsx_500_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row.values
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

simple_xlsx_reader simple_xlsx_reader

require 'simple_xlsx_reader'

workbook = SimpleXlsxReader.open './sample_excel_files/xlsx_500000_rows.xlsx'
worksheets = workbook.sheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

Here is an example of reading a legacy xls file using the spreadsheet library: 这是使用spreadsheet库读取旧版xls文件的示例:

spreadsheet 电子表格

require 'spreadsheet'

# Note: spreadsheet only supports .xls files (not .xlsx)
workbook = Spreadsheet.open './sample_excel_files/xls_500_rows.xls'
worksheets = workbook.worksheets
puts "Found #{worksheets.count} worksheets"

worksheets.each do |worksheet|
  puts "Reading: #{worksheet.name}"
  num_rows = 0
  worksheet.rows.each do |row|
    row_cells = row.to_a.map{ |v| v.methods.include?(:value) ? v.value : v }
    num_rows += 1
  end
  puts "Read #{num_rows} rows"
end

刚发现roo可能会胜任工作-可以满足我的要求,阅读基本的电子表格。

The roo gem works great for Excel (.xls and .xlsx) and it's being actively developed. roo gem对于Excel(.xls和.xlsx)非常有用,并且正在积极开发中。

I agree the syntax is not great nor ruby-like. 我同意语法不太好,也不像红宝石一样。 But that can be easily achieved with something like: 但这可以通过以下方式轻松实现:

class Spreadsheet
  def initialize(file_path)
    @xls = Roo::Spreadsheet.open(file_path)
  end

  def each_sheet
    @xls.sheets.each do |sheet|
      @xls.default_sheet = sheet
      yield sheet
    end
  end

  def each_row
    0.upto(@xls.last_row) do |index|
      yield @xls.row(index)
    end
  end

  def each_column
    0.upto(@xls.last_column) do |index|
      yield @xls.column(index)
    end
  end
end

I'm using creek which uses nokogiri. 我正在使用使用nokogiri的小河。 It is fast. 很快 Used 8.3 seconds on a 21x11250 xlsx table on my Macbook Air. 在Macbook Air的21x11250 xlsx桌上使用了8.3秒。 Got it to work on ruby 1.9.3+. 可以在Ruby 1.9.3+上运行。 The output format for each row is a hash of row and column name to cell content: {"A1"=>"a cell", "B1"=>"another cell"} The hash makes no guarantee that the keys will be in the original column order. 每行的输出格式是行和列名称对单元格内容的哈希:{“ A1” =>“ a cell”,“ B1” =>“另一个单元格”}哈希不能保证键将位于原始列顺序。 https://github.com/pythonicrubyist/creek https://github.com/pythonicrubyist/creek

dullard is another great one that uses nokogiri. 呆板是另一个使用nokogiri的伟大工具。 It is super fast. 超级快。 Used 6.7 seconds on a 21x11250 xlsx table on my Macbook Air. 在Macbook Air的21x11250 xlsx桌上使用了6.7秒。 Got it to work on ruby 2.0.0+. 可以在ruby 2.0.0+上运行。 The output format for each row is an array: ["a cell", "another cell"] https://github.com/thirtyseven/dullard 每行的输出格式是一个数组:[“一个单元格”,“另一个单元格”] https://github.com/thirtyseven/dullard

simple_xlsx_reader which has been mentioned is great, a bit slow. 已经提到过的simple_xlsx_reader很棒,有点慢。 Used 91 seconds on a 21x11250 xlsx table on my Macbook Air. 在Macbook Air的21x11250 xlsx桌上使用了91秒。 Got it to work on ruby 1.9.3+. 可以在Ruby 1.9.3+上运行。 The output format for each row is an array: ["a cell", "another cell"] https://github.com/woahdae/simple_xlsx_reader 每行的输出格式是一个数组:[“一个单元格”,“另一个单元格”] https://github.com/woahdae/simple_xlsx_reader

Another interesting one is oxcelix. 另一个有趣的是oxcelix。 It uses ox's SAX parser which supposedly faster than both nokogiri's DOM and SAX parser. 它使用ox的SAX解析器,该解析器比nokogiri的DOM和SAX解析器都快。 It supposedly outputs a Matrix. 据推测,它输出一个矩阵。 I could not get it to work. 我无法正常工作。 Also, there were some dependency issues with rubyzip. 此外,rubyzip还存在一些依赖项问题。 Would not recommend it. 不会推荐它。

In conclusion, creek seems like a good choice. 总之,小溪似乎是一个不错的选择。 Other posts recommend simple_xlsx_parser as it has similar performance. 其他帖子推荐simple_xlsx_parser,因为它具有类似的性能。

Removed dullard as recommended as it's outdated and people are getting errors/having problems with it. 根据建议,删除了过时的文字,因为它过时了,人们对此有错误/有问题。

If you're looking for more modern libraries, take a look at Spreadsheet: http://spreadsheet.rubyforge.org/GUIDE_txt.html . 如果您正在寻找更现代的库,请查看Spreadsheet: http : //spreadsheet.rubyforge.org/GUIDE_txt.html I can't tell if it supports XLSX files, but considering that it is actively developed, I'm guessing it does (I'm not on Windows, or with Office, so I can't test). 我无法确定它是否支持XLSX文件,但是考虑到它是积极开发的,我猜测它确实支持(我不在Windows或Office上,因此无法测试)。

At this point, it looks like roo is a good option again. 在这一点上, roo似乎又是一个不错的选择。 It supports XLSX, allows (some) iteration by just using times with cell access. 它支持XLSX,仅通过使用单元访问times就可以进行(某些)迭代。 I admit, it's not pretty though. 我承认,虽然不是很漂亮。

Also, RubyXL can now give you a sort of iteration using their extract_data method, which gives you a 2d array of data, which can be easily iterated over. 此外,RubyXL现在可以使用其extract_data方法为您提供某种迭代,该方法可以为您提供2d数据数组,可以轻松地对其进行迭代。

Alternatively, if you're trying to work with XLSX files on Windows, you can use Ruby's Win32OLE library that allows you to interface with OLE objects, like the ones provided by Word and Excel. 另外,如果您尝试在Windows上使用XLSX文件,则可以使用Ruby的Win32OLE库,该库允许您与OLE对象(如Word和Excel提供的对象)进行接口。 However , as @PanagiotisKanavos mentioned in the comments, this has a few major drawbacks: 但是 ,正如@PanagiotisKanavos在评论中提到的那样,它有一些主要缺点:

  • Excel must be installed 必须安装Excel
  • A new Excel instance is started for each document 为每个文档启动一个新的Excel实例
  • Memory and other resource consumption is far more than what is necessary for simple XLSX document manipulation. 内存和其他资源的消耗远远超出了简单XLSX文档操作所需的消耗。

But if you choose to use it, you can choose not to display Excel, load your XLSX file, and access it through it. 但是,如果选择使用它,则可以选择不显示Excel,加载XLSX文件并通过它访问它。 I'm not sure if it supports iteration, however, I don't think it would be too hard to build around the supplied methods, as it is the full Microsoft OLE API for Excel. 我不确定它是否支持迭代,但是,我认为围绕提供的方法进行构建不会太困难,因为它是用于Excel的完整Microsoft OLE API。 Here's the documentation: http://support.microsoft.com/kb/222101 Here's the gem: http://www.ruby-doc.org/stdlib-1.9.3/libdoc/win32ole/rdoc/WIN32OLE.html 这是文档: http : //support.microsoft.com/kb/222101这是瑰宝: http : //www.ruby-doc.org/stdlib-1.9.3/libdoc/win32ole/rdoc/WIN32OLE.html

Again, the options don't look much better, but there isn't much else out there, I'm afraid. 同样,这些选项看起来并没有好得多,但是恐怕没有太多其他选择了。 it's hard to parse a file format that is a black box. 很难解析是黑匣子的文件格式。 And those few who managed to break it didn't do it that visibly. 那些设法打破它的人很少能做到这一点。 Google Docs is closed source, and LibreOffice is thousands of lines of harry C++. Google文档是封闭源代码,而LibreOffice是数千行Carry C ++。

I've been working heavily with both Spreadsheet and rubyXL these past couple weeks and I must say that both are great tools. 在过去的两周中,我一直在与Spreadsheet和rubyXL进行大量的工作,我必须说两者都是很好的工具。 However, one area that both suffer is the lack of examples on actually implementing anything useful. 但是,两者都遭受的一个方面是缺少实际实施任何有用的示例。 Currently I'm building a crawler and using rubyXL to parse xlsx files and Spreadsheet for anything xls. 目前,我正在构建一个搜寻器,并使用rubyXL解析任何xls的xlsx文件和Spreadsheet。 I hope the code below can serve as a helpful example and show just how effective these tools can be. 我希望下面的代码可以作为一个有用的示例,并说明这些工具的有效性。

require 'find'
require 'rubyXL'

count = 0

Find.find('/Users/Anconia/crawler/') do |file|             # begin iteration of each file of a specified directory
  if file =~ /\b.xlsx$\b/                                  # check if file is xlsx format
    workbook = RubyXL::Parser.parse(file).worksheets       # creates an object containing all worksheets of an excel workbook
    workbook.each do |worksheet|                           # begin iteration over each worksheet
      data = worksheet.extract_data.to_s                   # extract data of a given worksheet - must be converted to a string in order to match a regex
      if data =~ /regex/
        puts file
        count += 1
      end      
    end
  end
end

puts "#{count} files were found"

require 'find'
require 'spreadsheet'
Spreadsheet.client_encoding = 'UTF-8'

count = 0

Find.find('/Users/Anconia/crawler/') do |file|             # begin iteration of each file of a specified directory
  if file =~ /\b.xls$\b/                                   # check if a given file is xls format
    workbook =  Spreadsheet.open(file).worksheets          # creates an object containing all worksheets of an excel workbook
    workbook.each do |worksheet|                           # begin iteration over each worksheet
      worksheet.each do |row|                              # begin iteration over each row of a worksheet
        if row.to_s =~ /regex/                             # rows must be converted to strings in order to match the regex
          puts file
          count += 1
        end
      end
    end
  end
end

puts "#{count} files were found"

rubyXL gem可以很好地解析XLSX文件。

I couldn't find a satisfactory xlsx parser. 我找不到令人满意的xlsx解析器。 RubyXL doesn't do date typecasting, Roo tried to typecast a number as a date, and both are a mess both in api and code. RubyXL不进行日期类型转换,Roo尝试将数字类型转换为日期,并且在api和代码中都一团糟。

So, I wrote simple_xlsx_reader . 所以,我写了simple_xlsx_reader You'd have to use something else for xls, though, so maybe it's not the full answer you're looking for. 但是,您必须为xls使用其他功能,所以也许这不是您要查找的完整答案。

Most of the online examples including the author's website for the Spreadsheet gem demonstrate reading the entire contents of an Excel file into RAM. 包括作者在Spreadsheet gem上的网站在内的大多数在线示例都演示了将Excel文件的全部内容读入RAM的过程。 That's fine if your spreadsheet is small. 如果您的电子表格很小,那很好。

xls = Spreadsheet.open(file_path)

For anyone working with very large files, a better way is to stream-read the contents of the file. 对于使用超大文件的任何人,更好的方法是流式读取文件的内容。 The Spreadsheet gem supports this--albeit not well documented at this time (circa 3/2015). Spreadsheet gem对此提供了支持-尽管目前(在2015年3月3日)尚无充分记录。

Spreadsheet.open(file_path).worksheets.first.rows do |row|
  # do something with the array of CSV data
end

CITE: https://github.com/zdavatz/spreadsheet 引用: https : //github.com/zdavatz/spreadsheet

The RemoteTable library uses roo internally. RemoteTable库在内部使用roo It makes it easy to read spreadsheets of different formats (XLS, XLSX, CSV, etc. possibly remote, possibly stored inside a zip, gz, etc.): 它使读取不同格式(XLS,XLSX,CSV等,可能是远程的,可能存储在zip,gz等内部)的电子表格变得容易:

require 'remote_table'
r = RemoteTable.new 'http://www.fueleconomy.gov/FEG/epadata/02data.zip', :filename => 'guide_jan28.xls'
r.each do |row|
  puts row.inspect
end

Output: 输出:

{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.0", "cyl"=>"6.0", "trans"=>"Auto(S4)", "drv"=>"R", "bidx"=>"60.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"20.0", "ucty"=>"19.1342", "uhwy"=>"30.2", "ucmb"=>"22.9121", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1238.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"2MODE", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ACURA", "carline name"=>"NSX", "displ"=>"3.2", "cyl"=>"6.0", "trans"=>"Manual(M6)", "drv"=>"R", "bidx"=>"65.0", "cty"=>"17.0", "hwy"=>"24.0", "cmb"=>"19.0", "ucty"=>"18.7", "uhwy"=>"30.4", "ucmb"=>"22.6171", "fl"=>"P", "G"=>"", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1302.0", "eng dscr"=>"DOHC-VTEC", "trans dscr"=>"", "vpc"=>"4.0", "cls"=>"1.0"}
{"Class"=>"TWO SEATERS", "Manufacturer"=>"ASTON MARTIN", "carline name"=>"ASTON MARTIN VANQUISH", "displ"=>"5.9", "cyl"=>"12.0", "trans"=>"Auto(S6)", "drv"=>"R", "bidx"=>"1.0", "cty"=>"12.0", "hwy"=>"19.0", "cmb"=>"14.0", "ucty"=>"13.55", "uhwy"=>"24.7", "ucmb"=>"17.015", "fl"=>"P", "G"=>"G", "T"=>"", "S"=>"", "2pv"=>"", "2lv"=>"", "4pv"=>"", "4lv"=>"", "hpv"=>"", "hlv"=>"", "fcost"=>"1651.0", "eng dscr"=>"GUZZLER", "trans dscr"=>"CLKUP", "vpc"=>"4.0", "cls"=>"1.0"}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM