简体   繁体   中英

Importing a large dataset into a database

I'm a beginning programmer in the relevant areas to this question, so if possible, it'd be helpful to avoid assuming I know a lot already.

I'm trying to import the OpenLibrary dataset into a local Postgres database. After it's imported, I plan to use it as a starting seed for a Ruby on Rails application that will include information on books.

The OpenLibrary datasets are available here, in a modified JSON format: http://openlibrary.org/dev/docs/jsondump

I only need very basic information for my application, much less than what is provided in the dumps. I'm only trying to get out book titles, author names, and relationships between books and authors.

Below are two typical entries from their dataset, the first for an author, and the second for a book (they seem to have an entry for each edition of a book). The entries seem to lead off with a primary key, and then with a type, before including the actual JSON database dump.

/a/OL2A /type/author {"name": "U. Venkatakrishna Rao", "personal_name": "U. Venkatakrishna Rao", "last_modified": {"type": "/type/datetime", "value": "2008-09-10 08:44:01.978456"}, "key": "/a/OL2A", "birth_date": "1904", "type": {"key": "/type/author"}, "id": 99, "revision": 3}

/b/OL345M /type/edition {"publishers": ["Social Science Research Project, Dept. of Geography, University of Dacca"], "pagination": "ii, 54 p.", "title": "Land use in Fayadabad area", "lccn": ["sa 65000491"], "subject_place": ["East Pakistan", "Dacca region."], "number_of_pages": 54, "languages": [{"comment": "initial import", "code": "eng", "name": "English", "key": "/l/eng"}], "lc_classifications": ["S471.P162 E23"], "publish_date": "1963", "publish_country": "pk ", "key": "/b/OL345M", "authors": [{"birth_date": "1911", "name": "Nafis Ahmad", "key": "/a/OL302A", "personal_name": "Nafis Ahmad"}], "publish_places": ["Dacca, East Pakistan"], "by_statement": "[by] Nafis Ahmad and F. Karim Khan.", "oclc_numbers": ["4671066"], "contributions": ["Khan, Fazle Karim, joint author."], "subjects": ["Land use -- East Pakistan -- Dacca region."]}

The size of the uncompressed dumps are enormous, about 2GB for the authors list, and 18GB for the book editions list. OpenLibrary does not provide any tools for this themselves, they provide a simple unoptimized Python script for reading in sample data (which unlike the actual dumps comes in pure JSON format), but they estimate if that was modified for use on their actual data it would take 2 months (!) to finish loading the data.

How can I read this into the database? I assume I'll need to write a program to do this. What language and any guidance on how I should do it to finish in a reasonable amount of time? The only scripting language I have any experience with is Ruby.

It will take two months to download the dump from their website. But it should only take a few hours to import this.

The fastest way will be for you to use Postgres' copy command. You can use that for the author's file. But the editions file needs to be inserted in both the books and author_books tables.

This script is in Python 2.6 but you should be able to adapt to Ruby if needed.

!#/usr/bin/env python
import json

fp = open('editions.json')
ab_out = open('/tmp/author_book.dump', 'w')
b_out = open('/tmp/book.dump', 'w')
for line in fp:
  vals = json.loads(s.split('/type/edition ')[1])
  b_out.write("%(key)s\t%(title)s\t(publish_date)s" % vals)
  for author in vals['authors']:
    ab_out.write("%s\t%s" % (vals['key'], author['key'])
fp.close()
ab_out.close()
b_out.close()

Then to copy to Postgres:

COPY book_table FROM '/tmp/book.dump'

如果TAPS可以在这里为您提供帮助,dunno,请访问http://adam.heroku.com/past/2009/2/11/taps_for_easy_database_transfers/

Following Scott Bailey's advice, I wrote Ruby scripts to modify the JSON into a format acceptable for the Postgres copy command. In case anyone else runs into this same problem, here are the scripts I wrote:

require 'rubygems'
require 'json'

fp = File.open('./edition.txt', 'r')
ab_out = File.new('./author_book.dump', 'w')
b_out = File.new('./book.dump', 'w')

i = 0
while (line = fp.gets) 
  i += 1
  start = line.index /\{/
  if start
    to_parse = line[start, line.length]
    vals = JSON.parse to_parse

    if vals["key"].nil? || vals["title"].nil?
      next
    end
    title = vals["title"]
    #Some titles contain backslashes and tabs, which we need to escape and remove, respectively
    title.gsub! /\\/, "\\\\\\\\"
    title.gsub! /\t/, " "
    if ((vals["isbn_10"].nil? || vals["isbn_10"].empty?) && (vals["isbn_13"].nil? || vals["isbn_13"].empty?))
      b_out.puts vals["key"] + "\t" + title + "\t" + '\N' + "\n"
    #Only get the first ISBN number
    elsif (!vals["isbn_10"].nil? && !vals["isbn_10"].empty?) 
      b_out.puts vals["key"] + "\t" + title + "\t" + vals["isbn_10"][0] + "\n"
    elsif (!vals["isbn_13"].nil? && !vals["isbn_13"].empty?)
      b_out.puts vals["key"] + "\t" + title + "\t" + vals["isbn_13"][0] + "\n"    
    end
    if vals["authors"]
      for author in vals["authors"]
        if !author["key"].nil?
          ab_out.puts vals["key"] + "\t" + author["key"]
        end
      end
    end
  else
    puts "Error processing line: " + line.to_s
  end
  if i % 100000 == 0
    puts "Processed line " + i.to_s
  end
end

fp.close
ab_out.close
b_out.close

and

require 'rubygems'
require 'json'

fp = File.open('./author.txt', 'r')
a_out = File.new('./author.dump', 'w')

i = 0
while (line = fp.gets) 
  i += 1
  start = line.index /\{/
  if start
    to_parse = line[start, line.length]
    vals = JSON.parse to_parse

    if vals["key"].nil? || vals["name"].nil?
      next
    end
    name = vals["name"]
    name.gsub! /\\/, "\\\\\\\\"
    name.gsub! /\t/, " "
    a_out.puts vals["key"] + "\t" + name + "\n"
  else
    puts "Error processing line: " + line.to_s
  end
  if i % 100000 == 0
    puts "Processed line " + i.to_s
  end
end

fp.close
a_out.close

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM