python配对两个大型列表中的项目

Question

我有两个文件：

file1（200mln行）格式：email：hash1：hash2
file2（90mln行）格式：hash：plaintext

我想要做的是用file2中的纯文本替换file1中的hash（1或2）。 我尝试使用问题的解决方案我先前在这里问了两个列表，在python中更快的比较（实际代码粘贴在下面）但不幸的是，这些大型数据集的速度并不快。 它适用于较小的文件（少量行），但不适用于较大的文件。

你有什么建议可以“更快”地处理这两个文件？

编辑：上面提到的源代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys, os

def banner():
    print('\n%s v 1.0\nby d2@tdhack.com\n' % sys.argv[0])

def getlength(fname):
    return sum(1 for line in open(fname))

def ifexist(fname):
    if not os.path.isfile(fname):
        banner()
        print('[-] %s must exist' % fname)
        sys.exit(1)

def replace(l, X, Y):
  for i,v in enumerate(l):
     if v == X:
        l.pop(i)
        l.insert(i, Y)

if len(sys.argv) < 2:
    banner()
    print('[-] please provide CRACKED and HASHES files')
    sys.exit(1)

CRACKED=sys.argv[1]
HASHES=sys.argv[2]

ifexist(CRACKED)
ifexist(HASHES)

banner()
print('[i] preparing lists from "%s" [%d lines] and "%s" [%d lines]' %(CRACKED, getlength(CRACKED), HASHES, getlength(HASHES)))
with open(CRACKED) as crackedfile:
    cracked = dict(map(str, line.split(':', 1)) for line in crackedfile if ':' in line)

hashdata = [line.rstrip('\n') for line in open(HASHES)]

print('[i] pairing items, this will take a while so please be patient')
for item in hashdata:
    if item in cracked:
        replace(hashdata, item, item+':'+cracked[item].strip('\n'))

print('[i] writting changes')
fout = open(HASHES+'_paired', 'w')
for item in hashdata:
    fout.write(item+'\n')
fout.close()

print('[+] done, now check "%s" [%d lines] file for results.' % (HASHES+'_paired', getlength(HASHES+'_paired')))

Answer 1

有了这么多密钥，我强烈建议使用某种类型的Python数据库来完成你的任务。 使用SQL数据库，您将拥有两个如下所示的表：

emails_and_hashes

column_name | column_type
----------- | ------------
email       | VARCHAR(255)
----------- | ------------
hash1       | VARCHAR(255)
----------- | ------------
hash2       | VARCHAR(255)

hash1索引和hash2索引。

hash_to_plaintext

column_name | column_type
----------- | ------------
hash        | VARCHAR(255)
----------- | ------------
plaintext   | TEXT

hash索引。

然后使用Python DB连接器迭代这两个表并在Python中更新它们的记录。 这比尝试处理dict的数亿条记录要快得多。 您可以使用此表设置，MySQL数据库和Python MySQL连接器库，使用类似于以下的代码（您可能需要进行一些调整，这不是一个确切的答案）：

import mysql.connector
con = mysql.connector.connect(user='your_user', password='your_password', database='your_database', host='your_host')
cur = con.cursor(dictionary=True) # 'dictionary=True' is my preference

# open your file with emails and hashes
f = open('/path/to/file1', 'r')
for line in f:
    email = line.split(':')[0]
    hash1 = line.split(':')[1]
    hash2 = line.split(':')[2]

    cur.execute("SELECT plaintext FROM hash_to_plaintext WHERE hash = %s", (hash1))
    plaintext1 = cur.fetchall()[0]
    cur.execute("SELECT plaintext FROM hash_to_plaintext WHERE hash = %s", (hash2))
    plaintext2 = cur.fetchall()[0]

    cur.execute("INSERT INTO emails_and_hashes VALUES (%s, %s, %s)", (email, hash1, hash2))

con.commit()
con.close()

Answer 2

经过一天的思考，我想出了使用Trie的想法。

trie将允许您将重复的哈希字典存储在更高效的容器中并以相同的成本查找。

在PyPi中存在一个名为marisa-trie的Trie的良好实现。

以下是关于如何实现它的一个想法：

import marisa_trie
import operator

with open("file2", "rb") as myfile:
    trie = marisa_trie.BytesTrie(map(operator.methodcaller("split", b":", 1), myfile))

with open("file1", "rb") as input_file, open("modified_file1", "wb") as output_file:
    for line in input_file:
        email, hash1, hash2 = line.split(b":")
        output_file.write(b":".join([email, trie[hash1], trie[hash2]]))

这应该是快速的，并且比dict的内存效率高50-100倍。

您还可以存储已处理的trie，因此您不需要每次都重新创建它：

trie.save('my_hashes.trie')

并加载它：

trie = marisa_trie.BytesTrie()
trie.load('my_hashes.trie')

python配对两个大型列表中的项目

问题描述

2 个解决方案

解决方案1
0 2016-06-20 21:14:51

解决方案2
0 2016-06-22 17:07:43

python配对两个大型列表中的项目

问题描述

2 个解决方案

解决方案1 0 2016-06-20 21:14:51

解决方案2 0 2016-06-22 17:07:43

解决方案1
0 2016-06-20 21:14:51

解决方案2
0 2016-06-22 17:07:43