简体   繁体   English

使用线程获取文件中每个单词的计数

[英]Getting the count of every word in a file using threads

Im currently trying to use threads to get the count of every word in a file in a parallel manner, but at the current time my code gets slower when i add even just an extra thread.我目前正在尝试使用线程以并行方式获取文件中每个单词的计数,但目前我的代码即使添加一个额外的线程也会变慢。 I feel like it should get a decrease in time as the threads increase until i bottleneck my cpu then my times should get slower again.我觉得随着线程的增加,时间应该会减少,直到我的 CPU 出现瓶颈,然后我的时间应该会再次变慢。 I don't understand why its not acting parallel.我不明白为什么它不平行。

here is the code这是代码

import thread
import threading
import time
import sys
class CountWords(threading.Thread):
    def __init__(self,lock,tuple):
        threading.Thread.__init__(self)
        self.lock = lock
        self.list = tuple[1]
        self.dit = tuple[0]
    def run(self):
        for word in self.list:
            #self.lock.acquire()
            if word in self.dit.keys():
                self.dit[word] = self.dit[word] + 1
            else:
                self.dit[word] = 1
            #self.lock.release()


def getWordsFromFile(numThreads, fileName):
    lists = []
    for i in range(int(numThreads)):
        k = []
        lists.append(k)
    print len(lists)
    file = open(fileName, "r")  # uses .read().splitlines() instead of readLines() to get rid of "\n"s
    all_words = map(lambda l: l.split(" "), file.read().splitlines()) 
    all_words = make1d(all_words)
    cur = 0
    for word in all_words:
        lists[cur].append(word)
        if cur == len(lists) - 1:
            cur = 0
        else:
            cur = cur + 1
    return lists

def make1d(list):
    newList = []
    for x in list:
        newList += x
    return newList

def printDict(dit):# prints the dictionary nicely
    for key in sorted(dit.keys()):
        print key, ":", dit[key]  



if __name__=="__main__":
    print "Starting now"
    start = int(round(time.time() * 1000))
    lock=threading.Lock()
    ditList=[]
    threadList = []
    args = sys.argv
    numThreads = args[1]
    fileName = "" + args[2]
    for i in range(int(numThreads)):
        ditList.append({})
    wordLists = getWordsFromFile(numThreads, fileName)
    zipped = zip(ditList,wordLists)
    print "got words from file"
    for tuple in zipped:
        threadList.append(CountWords(lock,tuple))
    for t in threadList:
        t.start()
    for t in threadList:
        if t.isAlive():
            t.join()
    fin = int(round(time.time() * 1000)) - start
    print "with", numThreads, "threads", "counting the words took :", fin, "ms"
    #printDict(dit)

You can use itertools for counting words in file.below is simple example code.explore itertools.groupby and modify code according to your logic.您可以使用 itertools 计算文件中的单词。下面是简单的示例代码。探索 itertools.groupby 并根据您的逻辑修改代码。

import itertools

tweets = ["I am a cat", "cat", "Who is a good cat"]

words = sorted(list(itertools.chain.from_iterable(x.split() for x in tweets)))
count = {k:len(list(v)) for k,v in itertools.groupby(words)}

Python cannot run threads in parallel (leveraging multiple cores) due to the GIL ( What is a global interpreter lock (GIL)? ).由于 GIL( 什么是全局解释器锁(GIL)? ),Python 无法并行运行线程(利用多个内核)。

Addind threads to this task is only increasing the overhead of your code, making it slower.向此任务添加线程只会增加代码的开销,使其变慢。

I can say two situations you can use threads:我可以说两种情况你可以使用线程:

  • When you have a lot of I/O : threads can make your code run concurrently (not in parallel https://blog.golang.org/concurrency-is-not-parallelism ), thus your code can do a lot while waiting for response getting a good speed up.当你有很多 I/O 时:线程可以让你的代码并发运行(不是并行https://blog.golang.org/concurrency-is-not-parallelism ),因此你的代码可以在等待的同时做很多事情响应得到了很好的加速。
  • You don't want a huge computation blocking your code : you use thread to run this computation concurrently with other tasks.您不希望大量计算阻塞您的代码:您使用线程与其他任务同时运行此计算。

If you want to leverage all your cores you need to use the multiprocessing module ( https://docs.python.org/3.6/library/multiprocessing.html ).如果您想利用所有内核,则需要使用多处理模块 ( https://docs.python.org/3.6/library/multiprocessing.html )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM