简体   繁体   English

如何使用线程来提高python的性能

[英]How to use threading to improve performance in python

I have a list of sentences that has around 500,000 sentences . 我有一个约有500,000 sentencessentences列表。 And also a list of concepts that have around 13,000,000 concepts . 以及包含约13,000,000 conceptsconcepts列表。 For each sentence I want to extract concepts from sentences in the order of sentence and write it to output. 对于每个句子,我想按sentences顺序从sentences中提取concepts并将其写入输出。

For example, my python programme looks as follows. 例如,我的python程序如下所示。

import re

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

output = []
counting = 0

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall

for sentence in sentences:
    output.append(find_all_concepts(sentence))

print(output)

The output is; 输出是; [['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems'], ['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'knowledge discovery', 'databases process']]

However, the order of the output is not important to me. 但是,输出顺序对我而言并不重要。 ie my output could also look likes follows (In other words the lists inside the output can be shuffled). 即我的输出也可能如下所示(换句话说, output的列表可以混排)。

[['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'knowledge discovery', 'databases process'], ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems']]

[['data mining', 'knowledge discovery', 'databases process'], ['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems']]

However, due to to the length of my sentences and concepts this program is still quite slow. 但是,由于我的sentencesconcepts该程序仍然很慢。

Is it possible to further improve the performance (in terms of time) using multithreading in python? 是否可以在python中使用多线程来进一步提高性能(在时间方面)?

Whether multi-threading will yield an actual performance increase, does not just depend on the implementation in Python and the amount of data, it also depends on the hardware executing the program. 多线程是否会带来实际的性能提升,不仅取决于Python的实现和数据量,还取决于执行程序的硬件。 In some cases, where the hardware offers no advantage, multi-threading may end up slowing things down due to increased overhead. 在某些情况下,硬件没有优势,多线程最终可能会由于开销增加而减慢速度。

However, assuming you're running on a modern standard PC or better, you may see some improvement with multi-threading. 但是,假设您在现代标准PC或更高版本的PC上运行,则多线程可能会有所改进。 The problem then is to set up a number of workers, pass the work to them and collect the results. 然后的问题是要建立许多工人,将工作交给他们并收集结果。

Staying close to your example structure, implementation and naming: 紧贴您的示例结构,实现和命名:

import re
import queue
import threading

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process',
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall


def do_find_all_concepts(q_in, l_out):
    while True:
        sentence = q_in.get()
        l_out.append(find_all_concepts(sentence))
        q_in.task_done()


# Queue with default maxsize of 0, infinite queue size
sentences_q = queue.Queue()
output = []

# any reasonable number of workers
num_threads = 2
for i in range(num_threads):
    worker = threading.Thread(target=do_find_all_concepts, args=(sentences_q, output))
    # once there's nothing but daemon threads left, Python exits the program
    worker.daemon = True
    worker.start()

# put all the input on the queue
for s in sentences:
    sentences_q.put(s)

# wait for the entire queue to be processed
sentences_q.join()
print(output)

User @wwii asked about multiple threads not really affecting performance for cpu-bound problems. @wwii用户询问了多个线程并没有真正影响CPU绑定问题的性能。 Instead of using multiple threads, accessing the same output variable, you could also use multiple processes, access a shared output queue, like this: 除了使用多个线程来访问相同的输出变量,您还可以使用多个进程来访问共享的输出队列,如下所示:

import re
import queue
import multiprocessing

sentences = [
    'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
    'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
    'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process',
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall


def do_find_all_concepts(q_in, q_out):
    try:
        while True:
            sentence = q_in.get(False)
            q_out.put(find_all_concepts(sentence))
    except queue.Empty:
        pass


if __name__ == '__main__':
    # default maxsize of 0, infinite queue size
    sentences_q = multiprocessing.Queue()
    output_q = multiprocessing.Queue()

    # any reasonable number of workers
    num_processes = 2
    pool = multiprocessing.Pool(num_processes, do_find_all_concepts, (sentences_q, output_q))

    # put all the input on the queue
    for s in sentences:
        sentences_q.put(s)

    # wait for the entire queue to be processed
    pool.close()
    pool.join()
    while not output_q.empty():
        print(output_q.get())

More overhead still, but using CPU resources available on other cores as well. 仍然会有更多开销,但是也会使用其他内核上可用的CPU资源。

Here are two solutions using concurrent.futures .ProcessPoolExecutor which will distribute the tasks to different processes. 下面是使用两种解决方案concurrent.futures .ProcessPoolExecutor将任务分配给不同的进程。 Your task appears to be cpu bound not i/o bound so threads probably won't help. 您的任务似乎是cpu绑定的而不是i / o绑定的,因此线程可能无济于事。

import re
import concurrent.futures

# using the lists in your example

re_concepts = [re.escape(t) for t in concepts]
all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL)

def f(sequence, regex=all_concepts):
    result = regex.findall(sequence)
    return result

if __name__ == '__main__':

    out1 = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        futures = [executor.submit(f, s) for s in sentences]
        for future in concurrent.futures.as_completed(futures):
            try:
                result = future.result()
            except Exception as e:
                print(e)
            else:
                #print(result)
                out1.append(result)   

    out2 = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for result in executor.map(f, sentences):
            #print(result)
            out2.append(result)

Executor.map() has a chunksize parameter: the docs say sending chunks of greater than one item of the iterable could be beneficial. Executor.map()有一个chunksize参数: 文档说发送大于一个可迭代项的块可能是有益的。 The function would need to be refactored to account for that. 该功能将需要重构以解决此问题。 I tested this with a function that would just return what it was sent but regardless of the chunksize I specified the test function only returned single items. 我用一个仅返回所发送内容的函数进行了测试,但是无论我指定的块大小如何,测试函数仅返回单个项目。 ¿go figure? 去搞清楚?

def h(sequence):
    return sequence

One drawback with Multiprocessing is that the data must be serialized/pickled to be sent to the process which takes time and might be significant for a compiled regular expression that large - it might defeat the gains from multiple processes. 多重处理的一个缺点是必须对数据进行序列化/提取以将其发送到进程,这需要花费时间,并且对于一个大的已编译正则表达式而言可能是重要的-它可能会抵消来自多个进程的收益。

I made a set of 13e6 random strings with 20 characters each to approximate your compiled regex. 我制作了一组13e6随机字符串,每个字符串包含20个字符,以近似于您编译的正则表达式。

data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))

Pickling to an io.BytesIO stream takes about 7.5 seconds and unpickling from a io.BytesIO stream takes 9 seconds. io.BytesIO流进行酸洗大约需要7.5秒,而从io.BytesIO流中进行酸洗则需要9秒。 If using a multiprocessing solution, it may be beneficial to pickle the concepts object (in whatever form) to the hard drive once then have each process unpickle from the hard drive rather than pickling/unpickling on each side of the IPC each time a new process is created, definitely worth testing - YMMV . 如果使用多处理解决方案,将概念对象(以任何形式)酸洗到硬盘驱动器上,然后让每个进程从硬盘驱动器上酸洗,而不是每次新进程在IPC的每一侧酸洗/酸洗,将是有益的。创建后,绝对值得测试-YMMV。 The pickled set is 380 MB on my hard drive. 我的硬盘驱动器上的腌制集为380 MB。

When I tried some experiments with concurrent.futures.ProcessPoolExecutor I kept blowing up my computer because each process needed its own copy of the set and my computer just doesn't have enough ram. 当我尝试使用current.futures.ProcessPoolExecutor进行一些实验时,我一直在炸毁我的计算机,因为每个进程都需要它自己的集合副本,而我的计算机没有足够的内存。

I'm going to post another answer dealing with the method of testing for concepts in sentences. 我将发布另一个答案,涉及句子中概念测试的方法。

This answer will address improving performance without using concurrency. 该答案将解决不使用并发的情况下提高性能的问题。


The way you structured your search you are looking for 13 million unique things in each sentence. 您进行搜索的结构方式是,每个句子中要查找1300万个独特的内容。 You said it takes 3-5 minutes for each sentence and that the word lengths in concepts range from one to ten. 您说每个句子需要3-5分钟, concepts中的单词长度从1到10不等。

I think you can improve the search time by making a set of concepts (either initially when constructed or from your list) then splitting each sentence into strings of one to ten (consecutive) words and testing for membership in the set. 我认为您可以通过制定一组concepts (最初是在构建时或从列表中)然后将每个句子拆分为一个由十到十个(连续)单词组成的字符串并测试该集合中的隶属关系来缩短搜索时间。

Example of a sentence split into 4 word strings: 分成4个单词字符串的句子示例:

'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems'
# becomes
[('data', 'mining', 'is', 'the'),
 ('mining', 'is', 'the', 'process'),
 ('is', 'the', 'process', 'of'),
 ('the', 'process', 'of', 'discovering'),
 ('process', 'of', 'discovering', 'patterns'),
 ('of', 'discovering', 'patterns', 'in'),
 ('discovering', 'patterns', 'in', 'large'),
 ('patterns', 'in', 'large', 'data'),
 ('in', 'large', 'data', 'sets'),
 ('large', 'data', 'sets', 'involving'),
 ('data', 'sets', 'involving', 'methods'),
 ('sets', 'involving', 'methods', 'at'),
 ('involving', 'methods', 'at', 'the'),
 ('methods', 'at', 'the', 'intersection'),
 ('at', 'the', 'intersection', 'of'),
 ('the', 'intersection', 'of', 'machine'),
 ('intersection', 'of', 'machine', 'learning'),
 ('of', 'machine', 'learning', 'statistics'),
 ('machine', 'learning', 'statistics', 'and'),
 ('learning', 'statistics', 'and', 'database'),
 ('statistics', 'and', 'database', 'systems')]

Process: 处理:

concepts = set(concepts)
sentence = sentence.split()
#one word
for meme in sentence:
    if meme in concepts:
        #keep it
#two words
for meme in zip(sentence,sentence[1:]):
    if ' '.join(meme) in concepts:
        #keep it
#three words
for meme in zip(sentence,sentence[1:],sentence[2:]):
    if ' '.join(meme) in concepts:
        #keep it

Adapting an itertools recipe (pairwise) you can automate that process of making n-word strings from a sentence: 改编itertools配方(成对),可以自动执行从句子中生成n字字符串的过程:

from itertools import tee
def nwise(iterable, n=2):
    "s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
    iterables = tee(iterable, n)
    # advance each iterable to the appropriate starting point
    for i, thing in enumerate(iterables[1:],1):
        for _ in range(i):
            next(thing, None)
    return zip(*iterables)

Testing each sentence looks like this 测试每个句子看起来像这样

sentence = sentence.strip().split()
for n in [1,2,3,4,5,6,7,8,9,10]:
    for meme in nwise(sentence,n):
        if ' '.join(meme) in concepts:
            #keep meme

I made a set of 13e6 random strings with 20 characters each to approximate concepts . 我制作了一组13e6随机字符串,每个字符串包含20个字符,以近似concepts

import random, string
data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))

Testing a four or forty character string for membership in data consistently takes about 60 nanoseconds. 测试四个四十字符串按成员的data一致需要大约60纳秒。 A one hundred word sentence has 955 one to ten word strings so searching that sentence should take ~60 microseconds. 一百个单词的句子包含955个一到十个单词字符串,因此搜索该句子应花费约60微秒。

The first sentence from your example 'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems' has 195 possible concepts (one to ten word strings). 示例中的第一句话'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems'有195个可能的概念(一到十个字串)。 Timing for the following two functions is about the same: about 140 microseconds for f and 150 microseconds for g : 定时为以下两种功能是大约相同的:约140微秒f和150微秒g

def f(sentence, data=data, nwise=nwise):
    '''iterate over memes in sentence and see if they are in data'''
    sentence = sentence.strip().split()
    found = []
    for n in [1,2,3,4,5,6,7,8,9,10]:
        for meme in nwise(sentence,n):
            meme = ' '.join(meme)
            if meme in data:
                found.append(meme)
    return found

def g(sentence, data=data, nwise=nwise):
    'make a set of the memes in sentence then find its intersection with data'''
    sentence = sentence.strip().split()
    test_strings = set(' '.join(meme) for n in range(1,11) for meme in nwise(sentence,n))
    found = test_strings.intersection(data)
    return found

So these are just approximations since I'm not using your actual data but it should speed things up quite a bit. 因此,这些只是近似值,因为我没有使用您的实际数据,但可以使速度大大提高。

After testing with your example data I found that g won't work if a concept appears twice in a sentence. 用示例数据进行测试后,我发现如果一个概念在句子中出现两次,则g将无效。


So here it is all together with the concepts listed in the order they are found in each sentence. 因此,这里的所有内容与在每个句子中找到的顺序都列出了这些概念。 The new version of f will take longer but the added time should be relatively small. f的新版本将花费更长的时间,但是增加的时间应该相对较少。 If possible would you post a comment letting me know how much longer it is than the original? 如果可能的话,您会发表评论让我知道它比原始评论多了吗? (I'm curious). (我很好奇)。

from itertools import tee

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

concepts = set(concepts)

def nwise(iterable, n=2):
    "s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
    iterables = tee(iterable, n)
    # advance each iterable to the appropriate starting point
    for i, thing in enumerate(iterables[1:],1):
        for _ in range(i):
            next(thing, None)
    return zip(*iterables)

def f(sentence, concepts=concepts, nwise=nwise):
    '''iterate over memes in sentence and see if they are in concepts'''
    indices = set()
    #print(sentence)
    words = sentence.strip().split()
    for n in [1,2,3,4,5,6,7,8,9,10]:
        for meme in nwise(words,n):
            meme = ' '.join(meme)
            if meme in concepts:
                start = sentence.find(meme)
                end = len(meme)+start
                while (start,end) in indices:
                    #print(f'{meme} already found at character:{start} - looking for another one...') 
                    start = sentence.find(meme, end)
                    end = len(meme)+start
                indices.add((start, end))
    return [sentence[start:end] for (start,end) in sorted(indices)]


###########
results = []
for sentence in sentences:
    results.append(f(sentence))
    #print(f'{sentence}\n\t{results[-1]})')


In [20]: results
Out[20]: 
[['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems'],
 ['data mining', 'interdisciplinary subfield', 'information', 'information'],
 ['data mining', 'knowledge discovery', 'databases process', 'process']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM