簡體   English   中英

Ruby hash 數組的排序順序以有效的方式使用另一個數組,因此處理時間是恆定的

[英]Ruby sort order of array of hash using another array in an efficient way so processing time is constant

我有一些數據需要導出為 csv。 它目前大約有 10,000 條記錄,並且會繼續增長,因此我想要一種有效的方法來進行迭代,尤其是在每個循環運行多個循環時,一個接一個。 我的問題是,是否有辦法避免我在下面描述的許多 each 循環,如果沒有,我可以在 Ruby 的 each/map 旁邊使用其他東西來保持處理時間不變,而不管數據大小如何。

例如:

  1. 首先,我將循環遍歷整個數據以展平並重命名包含數組值的字段,這樣如果它僅包含數組中的兩個項目,則像 issue 和 hol 數組值這樣的字段將來自 issue_1 和 issue_1。

  2. 接下來我將執行另一個循環來獲取哈希數組中的所有唯一鍵。

  3. 使用步驟 2 中的唯一鍵,我將執行另一個循環,使用不同的數組對這些唯一鍵進行排序,該數組保存鍵的排列順序。

  4. 最后另一個循環生成 CSV

所以我每次都使用 Ruby 的 each/map 對數據進行了 4 次迭代,完成這個循環的時間會隨着數據大小的增加而增加。

原始數據格式如下:

def data
  [
     {"file"=> ["getty_883231284_200013331818843182490_335833.jpg"], "id" => "60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded" => "2019-12-24", "date_modified" => "2019-12-24", "book_title_1"=>"", "title"=> ["haha"], "edition"=> [""], "issue" => ["nov"], "creator" => ["yes", "some"], "publisher"=> ["Library"], "place_of_publication" => "London, UK"]},

    {"file" => ["getty_883231284_200013331818843182490_335833.jpg"], "id" => "60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded" => "2019-12-24", "date_modified"=>"2019-12-24", "book_title"=> [""], "title" => ["try"], "edition"=> [""], "issue"=> ["dec", 'ten'], "creator"=> ["tako", "bell", 'big mac'], "publisher"=> ["Library"], "place_of_publication" => "NY, USA"}]
end

通過展平 arrays 並重命名保存這些數組的鍵來重新映射日期

def csv_data
  @csv_data = [
     {"file_1"=>"getty_883231284_200013331818843182490_335833.jpg", "id"=>"60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded"=>"2019-12-24", "date_modified"=>"2019-12-24", "book_title_1"=>"", "title_1"=>"haha", "edition_1"=>"", "issue_1"=>"nov", "creator_1"=>"yes", "creator_2"=>"some", "publisher_1"=>"Library", "place_of_publication_1"=>"London, UK"},

    {"file_1"=>"getty_883231284_200013331818843182490_335833.jpg", "id"=>"60706a8e-882c-45d8-ad5d-ae898b98535f", "date_uploaded"=>"2019-12-24", "date_modified"=>"2019-12-24", "book_title_1"=>"", "title_1"=>"try", "edition_1"=>"", "issue_1"=>"dec", "issue_2" => 'ten', "creator_1"=>"tako", "creator_2"=>"bell", 'creator_3' => 'big mac', "publisher_1"=>"Library", "place_of_publication_1"=>"NY, USA"}]

end

對上述數據的標題進行排序

def csv_header

  csv_order = ["id", "edition_1", "date_uploaded",  "creator_1", "creator_2", "creator_3", "book_title_1", "publisher_1", "file_1", "place_of_publication_1", "journal_title_1", "issue_1", "issue_2", "date_modified"]

  headers_object = []
  sorted_header = []
  all_keys = csv_data.lazy.flat_map(&:keys).force.uniq.compact

  #resort using ordering by suffix eg creator_isni_1 comes before creator_isni_2
  all_keys = all_keys.sort_by{ |name| [name[/\d+/].to_i, name] }

  csv_order.each {|k| all_keys.select {|e| sorted_header << e if e.start_with? k} }

  sorted_header.uniq
end

生成 csv 還涉及更多循環:

def to_csv
  data = csv_data
  sorted_headers = csv_header(data)

  csv = CSV.generate(headers: true) do |csv|
    csv << sorted_header
    csv_data.lazy.each do |hash|
      csv << hash.values_at(*sorted_header)
    end
  end
end

老實說,我更感興趣的是我是否能夠在沒有進一步描述的情況下找出你想要的邏輯,而不是僅僅關於編程部分(但當然我也很喜歡,我做了一些 Ruby 已經很久了,這是一個很好的復習)。 由於沒有明確說明任務,因此必須通過閱讀您的描述、輸入數據和代碼來“提煉”它。

我認為您應該做的是將所有內容保留在非常基本且輕量級的 arrays 中,並在讀取數據的同時大步完成繁重的工作。 我還假設如果一個鍵以數字結尾,或者如果一個值是一個數組,你希望它作為 {key}_{n} 返回,即使只有一個值存在。

到目前為止,我想出了這段代碼(邏輯在評論中描述)和repl demo here

class CustomData
  # @keys array structure
  # 0: Key
  # 1: Maximum amount of values associated
  # 2: Is an array (Found a {key}_n key in feed,
  #    or value in feed was an array)
  #
  # @data: is a simple array of arrays
  attr_accessor :keys, :data
  CSV_ORDER = %w[
    id edition date_uploaded creator book_title publisher
    file place_of_publication journal_title issue date_modified
  ]

  def initialize(feed)
    @keys = CSV_ORDER.map { |key| [key, 0, false]}
    @data = []
    feed.each do |row|
      new_row = []
      # Sort keys in order to maintain the right order for {key}_{n} values
      row.sort_by { |key, _| key }.each do |key, value|
        is_array = false
        if key =~ /_\d+$/
          # If key ends with a number, extract key
          # and remember it is an array for the output
          key, is_array = key[/^(.*)_\d+$/, 1], true
        end
        if value.is_a? Array
          # If value is an array, even if the key did not end with a number,
          # we remember that for the output
          is_array = true
        else
          value = [value]
        end
        # Find position of key if exists or nil
        key_index = @keys.index { |a| a.first == key }
        if key_index
          # If you could have a combination of _n keys and array values
          # for a key in your feed, you need to change this portion here
          # to account for all previous values, which would add some complexity
          #
          # If current amount of values is greater than the saved one, override
          @keys[key_index][1] = value.length if @keys[key_index][1] < value.length
          @keys[key_index][2] = true if is_array and not @keys[key_index][2]
        else
          # It is a new key in @keys array
          key_index = @keys.length
          @keys << [key, value.length, is_array]
        end
        # Add value array at known key index
        # (will be padded with nil if idx is greater than array size)
        new_row[key_index] = value
      end
      @data << new_row
    end
  end

  def to_csv_data(headers=true)
    result, header, body = [], [], []
    if headers
      @keys.each do |key|
        if key[2]
          # If the key should hold multiple values, build the header string
          key[1].times { |i| header << "#{key[0]}_#{i+1}" }
        else
          # Otherwise it is a singular value and the header goes unmodified
          header << key[0]
        end
      end
      result << header
    end
    @data.each do |row|
      new_row = []
      row.each_with_index do |value, index|
        # Use the value counter from @keys to pad with nil values,
        # if a value is not present
        @keys[index][1].times do |count|
          new_row << value[count]
        end
      end
      body << new_row
    end
    result << body
  end

end

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM