将大型数据集导入数据库

Question

我是这个问题相关领域的新手程序员，因此，如果可能的话，最好避免假设我已经知道很多。

我正在尝试将OpenLibrary数据集导入本地Postgres数据库。 导入后，我计划将其用作Ruby on Rails应用程序的起始种子，该应用程序将包含书籍信息。

OpenLibrary数据集以修改后的JSON格式在此处提供： http ://openlibrary.org/dev/docs/jsondump

我的应用程序只需要非常基本的信息，比转储中提供的信息少得多。 我只是想弄清楚书名，作者名以及书与作者之间的关系。

以下是他们的数据集中的两个典型条目，第一个条目是针对作者的，第二个条目是针对书籍的（它们似乎在每个版本的书中都有一个条目）。 在包含实际的JSON数据库转储之前，这些条目似乎先是主键，然后是类型。

/ a / OL2A / type / author {“名称”：“ U。Venkatakrishna Rao”，“ personal_name”：“ U。Venkatakrishna Rao”，“ last_modified”：{“ type”：“ / type / datetime”，“ value” ：“ 2008-09-10 08：44：01.978456”}，“ key”：“ / a / OL2A”，“ birth_date”：“ 1904”，“ type”：{“ key”：“ / type / author”} ，“ id”：99，“修订”：3}

/ b / OL345M / type / edition {“发布者”：[“达卡大学地理系社会科学研究项目”]，“分页”：“ ii，54页。”，“标题”：“土地使用在Fayadabad地区”，“ lccn”：[“ sa 65000491”]，“ subject_place”：[“ East Pakistan”，“ Dacca region。”]，“ number_of_pages”：54，“ languages”：[{“ comment”：“初始导入”，“代码”：“ eng”，“名称”：“英语”，“键”：“ / l / eng”}]，“ lc_classifications”：[“ S471.P162 E23”]，“发布日期”： “ 1963”，“ publish_country”：“ pk”，“ key”：“ / b / OL345M”，“ authors”：[{“ birth_date”：“ 1911”，“ name”：“ Nafis Ahmad”，“ key”： “ / a / OL302A”，“ personal_name”：“ Nafis Ahmad”}]，“ publish_places”：[“ Dacca，东巴基斯坦”]，“ by_statement”：“ [by] Nafis Ahmad和F. Karim Khan。”，“ oclc_numbers”：[“ 4671066”]，“贡献”：[“ Khan，Fazle Karim，共同作者。”]，“主题”：[“土地使用-巴基斯坦东部-达卡地区。”]}

未压缩转储的大小巨大，作者列表约为2GB，书籍版本列表约为18GB。 OpenLibrary本身没有提供任何工具，它们提供了一个简单的未经优化的Python脚本来读取示例数据（与实际转储不同的是纯JSON格式），但是他们估计是否已对其进行修改以用于其实际数据需要2个月（！）才能完成数据加载。

如何将其读入数据库？ 我认为我需要编写一个程序来执行此操作。 我应该在合理的时间内完成使用哪种语言以及如何进行指导？ 我唯一有经验的脚本语言是Ruby。

Answer 1

从他们的网站下载转储将需要两个月。 但是导入它只需几个小时。

最快的方法是使用Postgres的copy命令。 您可以将其用于作者的文件。 但是，版本文件需要同时插入book和author_books表中。

该脚本在Python 2.6中，但是如果需要，您应该能够适应Ruby。

!#/usr/bin/env python
import json

fp = open('editions.json')
ab_out = open('/tmp/author_book.dump', 'w')
b_out = open('/tmp/book.dump', 'w')
for line in fp:
  vals = json.loads(s.split('/type/edition ')[1])
  b_out.write("%(key)s\t%(title)s\t(publish_date)s" % vals)
  for author in vals['authors']:
    ab_out.write("%s\t%s" % (vals['key'], author['key'])
fp.close()
ab_out.close()
b_out.close()

然后复制到Postgres：

COPY book_table FROM '/tmp/book.dump'

Answer 2

如果TAPS可以在这里为您提供帮助，dunno，请访问http://adam.heroku.com/past/2009/2/11/taps_for_easy_database_transfers/

Answer 3

按照Scott Bailey的建议，我编写了Ruby脚本，将JSON修改为Postgres copy命令可接受的格式。 万一其他人遇到同样的问题，这是我写的脚本：

require 'rubygems'
require 'json'

fp = File.open('./edition.txt', 'r')
ab_out = File.new('./author_book.dump', 'w')
b_out = File.new('./book.dump', 'w')

i = 0
while (line = fp.gets) 
  i += 1
  start = line.index /\{/
  if start
    to_parse = line[start, line.length]
    vals = JSON.parse to_parse

    if vals["key"].nil? || vals["title"].nil?
      next
    end
    title = vals["title"]
    #Some titles contain backslashes and tabs, which we need to escape and remove, respectively
    title.gsub! /\\/, "\\\\\\\\"
    title.gsub! /\t/, " "
    if ((vals["isbn_10"].nil? || vals["isbn_10"].empty?) && (vals["isbn_13"].nil? || vals["isbn_13"].empty?))
      b_out.puts vals["key"] + "\t" + title + "\t" + '\N' + "\n"
    #Only get the first ISBN number
    elsif (!vals["isbn_10"].nil? && !vals["isbn_10"].empty?) 
      b_out.puts vals["key"] + "\t" + title + "\t" + vals["isbn_10"][0] + "\n"
    elsif (!vals["isbn_13"].nil? && !vals["isbn_13"].empty?)
      b_out.puts vals["key"] + "\t" + title + "\t" + vals["isbn_13"][0] + "\n"    
    end
    if vals["authors"]
      for author in vals["authors"]
        if !author["key"].nil?
          ab_out.puts vals["key"] + "\t" + author["key"]
        end
      end
    end
  else
    puts "Error processing line: " + line.to_s
  end
  if i % 100000 == 0
    puts "Processed line " + i.to_s
  end
end

fp.close
ab_out.close
b_out.close

和

require 'rubygems'
require 'json'

fp = File.open('./author.txt', 'r')
a_out = File.new('./author.dump', 'w')

i = 0
while (line = fp.gets) 
  i += 1
  start = line.index /\{/
  if start
    to_parse = line[start, line.length]
    vals = JSON.parse to_parse

    if vals["key"].nil? || vals["name"].nil?
      next
    end
    name = vals["name"]
    name.gsub! /\\/, "\\\\\\\\"
    name.gsub! /\t/, " "
    a_out.puts vals["key"] + "\t" + name + "\n"
  else
    puts "Error processing line: " + line.to_s
  end
  if i % 100000 == 0
    puts "Processed line " + i.to_s
  end
end

fp.close
a_out.close

将大型数据集导入数据库

问题描述

3 个解决方案

解决方案1
1 已采纳 2010-03-16 00:00:06

解决方案2
0 2010-03-15 20:31:29

解决方案3
0 2010-03-17 05:58:32

将大型数据集导入数据库

问题描述

3 个解决方案

解决方案1 1 已采纳 2010-03-16 00:00:06

解决方案2 0 2010-03-15 20:31:29

解决方案3 0 2010-03-17 05:58:32

解决方案1
1 已采纳 2010-03-16 00:00:06

解决方案2
0 2010-03-15 20:31:29

解决方案3
0 2010-03-17 05:58:32