Finding common phrases between files with millions of lines

Question

I have two files with the following number of lines :

file1 - 110433003
file2 - 4838810

I need to find common phrases between these . Each line is of the form :

p1 ||| p2 ||| .......

The p1 of file1 can be the p2 in file2. Unfortunately, the code I have written is taking way too long to do this.

import sys
import os

if(len(sys.argv)<2):
        print 'python CommonPhrases.py enFr hrEn commonFile'
        sys.exit()
enFr = open(sys.argv[1],'r')
hrEn = open(sys.argv[2],'r')
common = open(sys.argv[3],'w')
sethrEn = set([])
setenFr= set([])
for line in hrEn:
        englishPhrase = line.split(' ||| ')[1]
        sethrEn.add(englishPhrase)

for line in enFr:
        englishPhrase = line.split(' ||| ')[0]
        if(englishPhrase in sethrEn):
                common.write(englishPhrase+'\n')

Is there a faster way to do this ?

Thanks

Answer 1

You definitely need a trie for something like this. It seems like you will be spending most of your time searching the set for a match.

Also every time I find myself trying to make python go faster, I switch to pypy. ( http://pypy.org/ ) It is extremely easy to setup (just download the binaries, put it in your path and change #!/usr/bin/env python to #!/usr/bin/env pypy) and gives speedups in the range of 5-10x for such tasks.

For a reference implementation using PyTrie see below.

#!/usr/bin/env pypy

import sys
import os
sys.path.append('/usr/local/lib/python2.7/dist-packages/PyTrie-0.1-py2.7.egg/')
from pytrie import SortedStringTrie as trie

if(len(sys.argv)<2):
        print 'python CommonPhrases.py enFr hrEn commonFile'
        sys.exit()
enFr = open(sys.argv[1],'r')
hrEn = open(sys.argv[2],'r')
common = open(sys.argv[3],'w')

sethrEn = trie()

for line in hrEn:
        englishPhrase = line.strip().split(' ||| ')[1]
        sethrEn[englishPhrase] = None

for line in enFr:
        englishPhrase = line.strip().split(' ||| ')[0]
        if(englishPhrase in sethrEn):
                common.write(englishPhrase+'\n')

Note that it requires minimum changes (4 lines) and you will need to install PyTrie 0.1. On my ubuntu system "sudo easy_install PyTrie" did the trick.

Hope that helps.

Answer 2

This sounds like a tree problem. Maybe this ideas can help you.

Using a tree can help find the common word. I think there can be two solutions based on the idea of creating a tree.

A tree, once implemented, will need to store every word of one file (just one file). Then, starts reading the second file and searching every word on that file in the tree.

This solution have some problems, of course. Storing a tree on memory of such amount of words (or lines) can need lots of MB of RAM.

Lets suppose you manage to use a fixed amount of RAM to store the data, so, there is only counted the data itself (the characters of the lines). In the worst case scenario, you will need 255^N bytes, where N is the lenght of the longest line (supossing that you are using almos every word combination of N extention). So, storing every combination possible of words of length 10, you will need 1.16252367019e+24 bytes of RAM. That is a lot. Remember, this solution (as far as I know) is "fast", but need more RAM than maybe you can find.

So, other solution, very very slow, is reading one line of file A, and then compare it with every single line of file B. It needs almost none RAM, but will need too much time, that maybe you will no be able of really wait for it.

So, maybe another solution is divide the problem.

You say you have a list of lines, we don't know of they are alphabetically sorted or not. So, maybe you can start reading file A, and create new files. Every new file will store, for example, the 'A' starting lines, other the lines that start with 'B', etc. Then, do the same to file B, and having as result two files that have the 'A' starting lines, one for original A file, and another for original B file. Then, compare them with a tree.

In the best case scenario, the lines will be divided equally, letting you use a tree on memory. In the worst case scenario, you will finish with only one file, the same as the starting A file, since maybe all lines starts with 'A'.

So, maybe, you can implement a way to divide more the files if they are still too big, first, by first character on the lines. Then, the 'A' starting lines, divide them in 'AA', 'AB', 'AC', etc, of course, deleting the previous division files, until you get files small enough to use a better method to search the same lines (maybe using a tree on memory).

This solution also can take a long time, but maybe not so long as the low-ram option, and also, don't need too much ram to work.

These are the solutions that I can think in this moment. Maybe they work, maybe not.

Finding common phrases between files with millions of lines

Question

2 answers

solution1
0 2012-12-22 00:50:00

solution2
0 2012-12-22 00:55:59

Finding common phrases between files with millions of lines

Question

2 answers

solution1 0 2012-12-22 00:50:00

solution2 0 2012-12-22 00:55:59

solution1
0 2012-12-22 00:50:00

solution2
0 2012-12-22 00:55:59