將大型數據集導入數據庫

Question

我是這個問題相關領域的新手程序員，因此，如果可能的話，最好避免假設我已經知道很多。

我正在嘗試將OpenLibrary數據集導入本地Postgres數據庫。 導入后，我計划將其用作Ruby on Rails應用程序的起始種子，該應用程序將包含書籍信息。

OpenLibrary數據集以修改后的JSON格式在此處提供： http ://openlibrary.org/dev/docs/jsondump

我的應用程序只需要非常基本的信息，比轉儲中提供的信息少得多。 我只是想弄清楚書名，作者名以及書與作者之間的關系。

以下是他們的數據集中的兩個典型條目，第一個條目是針對作者的，第二個條目是針對書籍的（它們似乎在每個版本的書中都有一個條目）。 在包含實際的JSON數據庫轉儲之前，這些條目似乎先是主鍵，然后是類型。

/ a / OL2A / type / author {“名稱”：“ U。Venkatakrishna Rao”，“ personal_name”：“ U。Venkatakrishna Rao”，“ last_modified”：{“ type”：“ / type / datetime”，“ value” ：“ 2008-09-10 08：44：01.978456”}，“ key”：“ / a / OL2A”，“ birth_date”：“ 1904”，“ type”：{“ key”：“ / type / author”} ，“ id”：99，“修訂”：3}

/ b / OL345M / type / edition {“發布者”：[“達卡大學地理系社會科學研究項目”]，“分頁”：“ ii，54頁。”，“標題”：“土地使用在Fayadabad地區”，“ lccn”：[“ sa 65000491”]，“ subject_place”：[“ East Pakistan”，“ Dacca region。”]，“ number_of_pages”：54，“ languages”：[{“ comment”：“初始導入”，“代碼”：“ eng”，“名稱”：“英語”，“鍵”：“ / l / eng”}]，“ lc_classifications”：[“ S471.P162 E23”]，“發布日期”： “ 1963”，“ publish_country”：“ pk”，“ key”：“ / b / OL345M”，“ authors”：[{“ birth_date”：“ 1911”，“ name”：“ Nafis Ahmad”，“ key”： “ / a / OL302A”，“ personal_name”：“ Nafis Ahmad”}]，“ publish_places”：[“ Dacca，東巴基斯坦”]，“ by_statement”：“ [by] Nafis Ahmad和F. Karim Khan。”，“ oclc_numbers”：[“ 4671066”]，“貢獻”：[“ Khan，Fazle Karim，共同作者。”]，“主題”：[“土地使用-巴基斯坦東部-達卡地區。”]}

未壓縮轉儲的大小巨大，作者列表約為2GB，書籍版本列表約為18GB。 OpenLibrary本身沒有提供任何工具，它們提供了一個簡單的未經優化的Python腳本來讀取示例數據（與實際轉儲不同的是純JSON格式），但是他們估計是否已對其進行修改以用於其實際數據需要2個月（！）才能完成數據加載。

如何將其讀入數據庫？ 我認為我需要編寫一個程序來執行此操作。 我應該在合理的時間內完成使用哪種語言以及如何進行指導？ 我唯一有經驗的腳本語言是Ruby。

Answer 1

從他們的網站下載轉儲將需要兩個月。 但是導入它只需幾個小時。

最快的方法是使用Postgres的copy命令。 您可以將其用於作者的文件。 但是，版本文件需要同時插入book和author_books表中。

該腳本在Python 2.6中，但是如果需要，您應該能夠適應Ruby。

!#/usr/bin/env python
import json

fp = open('editions.json')
ab_out = open('/tmp/author_book.dump', 'w')
b_out = open('/tmp/book.dump', 'w')
for line in fp:
  vals = json.loads(s.split('/type/edition ')[1])
  b_out.write("%(key)s\t%(title)s\t(publish_date)s" % vals)
  for author in vals['authors']:
    ab_out.write("%s\t%s" % (vals['key'], author['key'])
fp.close()
ab_out.close()
b_out.close()

然后復制到Postgres：

COPY book_table FROM '/tmp/book.dump'

Answer 2

如果TAPS可以在這里為您提供幫助，dunno，請訪問http://adam.heroku.com/past/2009/2/11/taps_for_easy_database_transfers/

Answer 3

按照Scott Bailey的建議，我編寫了Ruby腳本，將JSON修改為Postgres copy命令可接受的格式。 萬一其他人遇到同樣的問題，這是我寫的腳本：

require 'rubygems'
require 'json'

fp = File.open('./edition.txt', 'r')
ab_out = File.new('./author_book.dump', 'w')
b_out = File.new('./book.dump', 'w')

i = 0
while (line = fp.gets) 
  i += 1
  start = line.index /\{/
  if start
    to_parse = line[start, line.length]
    vals = JSON.parse to_parse

    if vals["key"].nil? || vals["title"].nil?
      next
    end
    title = vals["title"]
    #Some titles contain backslashes and tabs, which we need to escape and remove, respectively
    title.gsub! /\\/, "\\\\\\\\"
    title.gsub! /\t/, " "
    if ((vals["isbn_10"].nil? || vals["isbn_10"].empty?) && (vals["isbn_13"].nil? || vals["isbn_13"].empty?))
      b_out.puts vals["key"] + "\t" + title + "\t" + '\N' + "\n"
    #Only get the first ISBN number
    elsif (!vals["isbn_10"].nil? && !vals["isbn_10"].empty?) 
      b_out.puts vals["key"] + "\t" + title + "\t" + vals["isbn_10"][0] + "\n"
    elsif (!vals["isbn_13"].nil? && !vals["isbn_13"].empty?)
      b_out.puts vals["key"] + "\t" + title + "\t" + vals["isbn_13"][0] + "\n"    
    end
    if vals["authors"]
      for author in vals["authors"]
        if !author["key"].nil?
          ab_out.puts vals["key"] + "\t" + author["key"]
        end
      end
    end
  else
    puts "Error processing line: " + line.to_s
  end
  if i % 100000 == 0
    puts "Processed line " + i.to_s
  end
end

fp.close
ab_out.close
b_out.close

和

require 'rubygems'
require 'json'

fp = File.open('./author.txt', 'r')
a_out = File.new('./author.dump', 'w')

i = 0
while (line = fp.gets) 
  i += 1
  start = line.index /\{/
  if start
    to_parse = line[start, line.length]
    vals = JSON.parse to_parse

    if vals["key"].nil? || vals["name"].nil?
      next
    end
    name = vals["name"]
    name.gsub! /\\/, "\\\\\\\\"
    name.gsub! /\t/, " "
    a_out.puts vals["key"] + "\t" + name + "\n"
  else
    puts "Error processing line: " + line.to_s
  end
  if i % 100000 == 0
    puts "Processed line " + i.to_s
  end
end

fp.close
a_out.close

將大型數據集導入數據庫

問題描述

3 個解決方案

解決方案1
1 已采納 2010-03-16 00:00:06

解決方案2
0 2010-03-15 20:31:29

解決方案3
0 2010-03-17 05:58:32

將大型數據集導入數據庫

問題描述

3 個解決方案

解決方案1 1 已采納 2010-03-16 00:00:06

解決方案2 0 2010-03-15 20:31:29

解決方案3 0 2010-03-17 05:58:32

解決方案1
1 已采納 2010-03-16 00:00:06

解決方案2
0 2010-03-15 20:31:29

解決方案3
0 2010-03-17 05:58:32