简体   繁体   English

加速文本文件的 DASK 包处理?

[英]Speeding up DASK bag processing of a text file?

Hi I have the following code:嗨,我有以下代码:

import dask.bag as db 

class Similarity(object):

    def __init__(self):
        self.bag = None

    @staticmethod
    def olap(item, ary) :
        iix,item = item.strip().split(':')
        aix,ary = ary.strip().split(':')
        o = len( set(item.split(' ')) & set(ary.split(' ')) )
        return o,int(aix),int(iix),item


    def process(self, file):
        self.bag = db.read_text(file)   
        rv = []
        for c, i in enumerate(self.bag.take(100)) :
            rv.append( self.bag.map(Similarity.olap, i).filter(lambda x: x[1] != x[2]).max().compute() )
        return rv

For now i'm processing ~10_000 line text file, which is a light load.现在我正在处理 ~10_000 行文本文件,这是一个轻负载。 It is simply a Sentence per line which is split to words and compared.它只是每行一个句子,它被分成单词并进行比较。 Every line with ALL the other lines in the file.每行与文件中的所有其他行。

The problem is that it is too SLOOOW... 100 steps take ~1m 20sec with all the CPUs working.问题是它太慢了...... 100 步需要大约 1m 20 秒,所有 CPU 都在工作。 At the same time the score function is fast ~2 micro secs同时分数 function 快~2微秒

In [171]: %timeit Similarity.olap('5:aaa bbb ccc dddd  ooooooooo ppppppppppp jee', '7:bbb aa ccc ddd ee uu oooo pppp')                                                       
2.09 µs ± 56.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So do you have any tricks to help me SPEED it up?那么你有什么技巧可以帮助我加快速度吗?

Comparing 10**4 lines to each other is 10**8 computations, so it's not a very light computation.10**4行相互比较是10**8次计算,所以这不是一个非常轻量级的计算。 If your similarity metric is symmetric (seems to be the case), then you can halve the time, by comparing only one of every pair (so similarity from A to B is sufficient to know similarity from B to A).如果您的相似性度量是对称的(似乎是这种情况),那么您可以通过只比较每一对中的一个来将时间减半(因此从 A 到 B 的相似性足以知道从 B 到 A 的相似性)。

In terms of the actual code, you have .compute inside a loop, which will slow down computations because next iteration must wait for the previous computation to complete.就实际代码而言,循环中有.compute ,这将减慢计算速度,因为下一次迭代必须等待上一次计算完成。 Rough pseudocode for a faster way is as follows (will probably need adjusting for actual results):用于更快方法的粗略伪代码如下(可能需要根据实际结果进行调整):

# ensure this is in the imports along with the dask bag
import dask


# other code including Class definition skipped

    def process(self, file):
        self.bag = db.read_text(file)   
        rv = []
        for c, i in enumerate(self.bag.take(100)) :
            # no compute inside the loop
            rv.append( self.bag.map(Similarity.olap, i).filter(lambda x: x[1] != x[2]).max())

        # now the rv contains lazy objects only, so compute them at once
        rv = dask.compute(rv)
        return rv

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM