简体   繁体   English

在具有数百万行的文件之间查找常用短语

[英]Finding common phrases between files with millions of lines

I have two files with the following number of lines : 我有两个文件,其行数如下:

file1 - 110433003
file2 - 4838810

I need to find common phrases between these . 我需要在这些之间找到共同的短语。 Each line is of the form : 每行的格式为:

p1 ||| p1 ||| p2 ||| p2 ||| ....... .......

The p1 of file1 can be the p2 in file2. file1的p1可以是file2中的p2。 Unfortunately, the code I have written is taking way too long to do this. 不幸的是,我编写的代码花费的时间太长了。

import sys
import os

if(len(sys.argv)<2):
        print 'python CommonPhrases.py enFr hrEn commonFile'
        sys.exit()
enFr = open(sys.argv[1],'r')
hrEn = open(sys.argv[2],'r')
common = open(sys.argv[3],'w')
sethrEn = set([])
setenFr= set([])
for line in hrEn:
        englishPhrase = line.split(' ||| ')[1]
        sethrEn.add(englishPhrase)

for line in enFr:
        englishPhrase = line.split(' ||| ')[0]
        if(englishPhrase in sethrEn):
                common.write(englishPhrase+'\n')

Is there a faster way to do this ? 有更快的方法吗?

Thanks 谢谢

You definitely need a trie for something like this. 您肯定需要像这样的东西。 It seems like you will be spending most of your time searching the set for a match. 似乎您将花费大部分时间在该集合中寻找匹配项。

Also every time I find myself trying to make python go faster, I switch to pypy. 同样,每次我发现自己试图使python更快时,我都会切换到pypy。 ( http://pypy.org/ ) It is extremely easy to setup (just download the binaries, put it in your path and change #!/usr/bin/env python to #!/usr/bin/env pypy) and gives speedups in the range of 5-10x for such tasks. http://pypy.org/ )非常容易设置(只需下载二进制文件,将其放在您的路径中,然后将#!/ usr / bin / env python更改为#!/ usr / bin / env pypy),然后可以使此类任务的加速范围提高5-10倍。

For a reference implementation using PyTrie see below. 有关使用PyTrie的参考实现,请参见下文。

#!/usr/bin/env pypy

import sys
import os
sys.path.append('/usr/local/lib/python2.7/dist-packages/PyTrie-0.1-py2.7.egg/')
from pytrie import SortedStringTrie as trie

if(len(sys.argv)<2):
        print 'python CommonPhrases.py enFr hrEn commonFile'
        sys.exit()
enFr = open(sys.argv[1],'r')
hrEn = open(sys.argv[2],'r')
common = open(sys.argv[3],'w')

sethrEn = trie()

for line in hrEn:
        englishPhrase = line.strip().split(' ||| ')[1]
        sethrEn[englishPhrase] = None

for line in enFr:
        englishPhrase = line.strip().split(' ||| ')[0]
        if(englishPhrase in sethrEn):
                common.write(englishPhrase+'\n')

Note that it requires minimum changes (4 lines) and you will need to install PyTrie 0.1. 请注意,它需要最少的更改(4行),您将需要安装PyTrie 0.1。 On my ubuntu system "sudo easy_install PyTrie" did the trick. 在我的ubuntu系统上,“ sudo easy_install PyTrie”做到了。

Hope that helps. 希望能有所帮助。

This sounds like a tree problem. 这听起来像是树上的问题。 Maybe this ideas can help you. 也许这个想法可以帮助您。

Using a tree can help find the common word. 使用树可以帮助找到常用词。 I think there can be two solutions based on the idea of creating a tree. 我认为基于创建树的想法可以有两种解决方案。

A tree, once implemented, will need to store every word of one file (just one file). 一棵树一旦实现,将需要存储一个文件(仅一个文件)的每个单词。 Then, starts reading the second file and searching every word on that file in the tree. 然后,开始读取第二个文件并在树中搜索该文件上的每个单词。

This solution have some problems, of course. 当然,该解决方案存在一些问题。 Storing a tree on memory of such amount of words (or lines) can need lots of MB of RAM. 在这样数量的单词(或行)的存储器上存储树可能需要大量MB RAM。

Lets suppose you manage to use a fixed amount of RAM to store the data, so, there is only counted the data itself (the characters of the lines). 让我们假设您设法使用固定数量的RAM存储数据,因此,仅对数据本身(行的字符)进行计数。 In the worst case scenario, you will need 255^N bytes, where N is the lenght of the longest line (supossing that you are using almos every word combination of N extention). 在最坏的情况下,您将需要255 ^ N个字节,其中N是最长行的长度(假设您使用N扩展的每个单词组合都使用almos)。 So, storing every combination possible of words of length 10, you will need 1.16252367019e+24 bytes of RAM. 因此,存储长度为10的字的所有可能组合,您将需要1.16252367019e + 24字节RAM。 That is a lot. 好多 Remember, this solution (as far as I know) is "fast", but need more RAM than maybe you can find. 请记住,这种解决方案(据我所知)是“快速的”,但是需要更多的RAM,而您可能找不到。

So, other solution, very very slow, is reading one line of file A, and then compare it with every single line of file B. It needs almost none RAM, but will need too much time, that maybe you will no be able of really wait for it. 因此,另一种非常慢的解决方案是读取文件A的一行,然后将其与文件B的每一行进行比较。它几乎不需要RAM,但是需要太多时间,也许您将无法真的等一下。

So, maybe another solution is divide the problem. 因此,也许另一个解决方案是划分问题。

You say you have a list of lines, we don't know of they are alphabetically sorted or not. 您说您有一个行列表,我们不知道它们是否按字母顺序排序。 So, maybe you can start reading file A, and create new files. 因此,也许您可​​以开始读取文件A,然后创建新文件。 Every new file will store, for example, the 'A' starting lines, other the lines that start with 'B', etc. Then, do the same to file B, and having as result two files that have the 'A' starting lines, one for original A file, and another for original B file. 例如,每个新文件都将存储'A'起始行,以'B'开头的其他行等。然后,对文件B进行相同的操作,结果两个文件以'A'开头行,一个用于原始A文件,另一个用于原始B文件。 Then, compare them with a tree. 然后,将它们与一棵树进行比较。

In the best case scenario, the lines will be divided equally, letting you use a tree on memory. 在最佳情况下,行将被平均分割,从而使您可以在内存中使用树。 In the worst case scenario, you will finish with only one file, the same as the starting A file, since maybe all lines starts with 'A'. 在最坏的情况下,您将只完成一个文件,与开始的A文件相同,因为可能所有行都以“ A”开头。

So, maybe, you can implement a way to divide more the files if they are still too big, first, by first character on the lines. 因此,也许,您可以采用一种方法来分割更多的文件(如果它们仍然太大),首先,按行中的第一个字符。 Then, the 'A' starting lines, divide them in 'AA', 'AB', 'AC', etc, of course, deleting the previous division files, until you get files small enough to use a better method to search the same lines (maybe using a tree on memory). 然后,以“ A”开头,将它们分为“ AA”,“ AB”,“ AC”等,当然,删除先前的除法文件,直到获得足够小的文件以使用更好的方法搜索相同的文件为止。行(可能在内存上使用树)。

This solution also can take a long time, but maybe not so long as the low-ram option, and also, don't need too much ram to work. 此解决方案也可能需要很长时间,但可能需要很短的时间,因为低内存选项,而且不需要太多内存即可工作。

These are the solutions that I can think in this moment. 这些是我现在可以想到的解决方案。 Maybe they work, maybe not. 也许他们工作了,也许没有。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM