简体   繁体   English

大型文件的高效多处理/多线程

[英]Efficient multiprocessing/multithreading with large files

I have two large datasets full of hashes I need to do stuff against: 我有两个充满哈希值的大型数据集,我需要针对这些数据集:

sample1 (roughly 15gb size): sample1(大约15gb大小):

    0000002D9D62AEBE1E0E9DB6C4C4C7C16A163D2C
    00000142988AFA836117B1B572FAE4713F200567
    000001BCBC3B7C8C6E5FC59B686D3568132D218C
    000001E4975FA18878DF5C0989024327FBE1F4DF

sample2 (roughly 5gb size): sample2(大约5gb大小):

    0000002D9D62AEBE1E0E9DB6C4C4C7C16A163D2C
    00000142988AFA836117B1B572FAE4713F200567
    000001BCBC3B7C8C6E5FC59B686D3568132D218C
    000001E4975FA18878DF5C0989024327FBE1F4DF

I am currently trying to implement multiprocessing with both of these files iterating over another set of files in a directory like below: 我目前正在尝试对这两个文件进行迭代处理,以遍历目录中的另一组文件,如下所示:

if __name__ == '__main__':
    hash_path = glob2.glob(r'pathtohashes*.csv')
    sample1 = pd.read_csv(r'pathtosample1hashes.csv', names=['hash'])
    sample2 = pd.read_csv(r'pathtosample2hashes.csv', names=['hash'])
    for file in hash_path:
        jobs = []
        p = multiprocessing.Process(compare_function(file, sample1, sample2))
        jobs.append(p)
        p.start()

The function compares the file against both sample files and outputs to a directory. 该函数将文件与两个样本文件进行比较,然后输出到目录。

How can I make this more efficient? 如何提高效率? I feel as though I have too many processes with the full data set in memory when I could maintain a single item in memory and just reference it but am unsure as to how to do so. 当我可以在内存中维护单个项目并仅引用它但不确定如何操作时,我感觉好像我的内存中包含完整数据集的进程太多了。 Any tips on how to make this more efficient would be helpful. 有关如何提高效率的任何提示都将有所帮助。 Thank you for your assistance. 谢谢您的帮助。

You might want to look into using standard unix tools. 您可能需要研究使用标准的UNIX工具。 If you are trying to find common or missing items, be aware of the comm (aka calm ) and join commands. 如果您要查找常见或缺失的项目,请注意comm (又名calm )并join命令。 They are purpose-built, in C, for exactly this. 为此,它们是专门用C语言构建的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM