How can I match relating values in two database tables?

Question

To simplify my problem, lets say I have a table with a lot of books and their respective content. On the other hand I have a keyword table. I would like to find the matching pairs. Please see the simple Perl script below which illustrates the problem nicely.

#title => content
%books = (
    "Foodworld" => "Cheesburgers and Hamburgers are the best you can ...",
    "Marvelous Salad" => "Russian dressing is superb when ...",
    "Delicious Steaks" => "Only BBQ RipEye"
);

#id => keyword
%keywords = (
    "1234" => "Cheeseburgers",
    "2345" => "dressing",
    "9789" => "Hamburgers"
);

while ( my ($title,$content) = each %books ) {
  while ( my ($keywordID, $keyword) = each %keywords ) {
    if ( $content =~ /$keyword/ ) {
      print "$title \t $keywordID \n";
    }
  }
}

The output will be:

Marvelous Salad  2345
Foodworld        1234
Foodworld        9789

My problem is, that the collection of books contains ~70,000 titles and the list of keywords ~30,000 words. Both are in separate tables on a MySQL server. Any suggestions? How would you solve this task? Could you just point me in a good direction?

Answer 1

At first blush this sounds like you want to create a junction table relating books to key_words. In fact you might want to create two junction tables --- one relating titles to key_words and the other relating contents to key_words.

A junction table simple consists of pairs of columns, each of which "REFERENCES FOREIGN KEY"... one for the "book" ID and the other for the "key_word" ID.

You'd still need to perform the nest loops to create these junction key references and that table could be huge (a row for every combination of key_word and title/contents). But queries could be quite fast.

You'd have roughly three types of simple queries through either of these junction tables. One finds all books containing a given key_word, another finds all key_words associated with a given book and the last tells you if a given key_word/book combination exists.

(Other, more complex queries could find things like intersections and set differences of books and key_words --- all books that contain references to "dolphin(s)" and to "pet(s)." Further considerations would also apply to word-stemming and you might want to use a library to normalized words into their stems).

Junction tables normally have a composite key on both of their columns (and normally don't have a surrogate key of their own). This implicitly creates an index while also imposing the UNIQUE constraint on that composite. The "REFERENCES FOREIGN KEY" clause also ensure referential integrity of the associations --- and implies that you have to create the book/title and key_word entries before you can create any associations. (Further any deletions of these entities would necessitate either removing all junction entries or using a CASCADE option on the DDL).

Answer 2

Algorithmically, I can't see any shortcuts - you've got to check each title for each keyword, and so the two loops you've got are about the only way to do that.

What I'd offer as a way to speed the process is that you can compile regular expressions - and it's worth doing the the scenario you've got.

Perl normally compiles a static regex, but if it contains a variable, it can't. You can, however, use:

Is there a way to precompile a regex in Perl?

which'll improve things somewhat. You might find something like:

my $regex = join ( "|", keys %keywords );
$regex = qr/$regex/;

It may be more efficient to make a 30,000 word compiled RE, than to test each individually. You'd need to test it yourself to check though. ( Devel::NYTProf might help)

I would also suggest - it looks like the way your code is, the complete contents of a book is loaded into $content . You want to avoid doing that with more than one at a time - which it looks like you are. But I'd suggest you need to be cautious about bulk fetching all your books from your DB - fetch one at a time, and then check it, assuming $content is fairly large.

I would add - this problem will scale well, as you don't have data dependencies. You could probably use threading or forking in Perl to parallelise. But be cautious, as DBI isn't thread safe. (or at least, not necessarily)

How can I match relating values in two database tables?

Question

2 answers

solution1
2 2015-02-24 21:42:58

solution2
0 ACCPTED 2015-02-24 21:20:20

How can I match relating values in two database tables?

Question

2 answers

solution1 2 2015-02-24 21:42:58

solution2 0 ACCPTED 2015-02-24 21:20:20

solution1
2 2015-02-24 21:42:58

solution2
0 ACCPTED 2015-02-24 21:20:20