如何在我的 python 脚本中实现多处理？

Question

So the thing I'm trying to do is to have multiple cores process some stuff....所以我想做的事情是让多个核心处理一些东西......

I have a py script which basically makes a feature vector for my ML and DL models.我有一个 py 脚本，它基本上为我的 ML 和 DL 模型创建了一个特征向量。 Now the thing I'm doing is that I take a txt file, I compare it with a list that I have initialized at runtime.现在我正在做的是我获取一个 txt 文件，我将它与我在运行时初始化的列表进行比较。 So the code takes the first line in the txt and checks where that line is in the unique list and in that position, it does +1 in another list.因此，代码获取 txt 中的第一行并检查该行在唯一列表中的位置以及在该位置，它在另一个列表中 +1。 Then that list is appended to another list which will then be written in a csv file, but that happens at the end of the execution.然后将该列表附加到另一个列表，然后将其写入 csv 文件，但这发生在执行结束时。 Now without threading, I was able to process 1 file at around 500ms to 1.5sec.现在没有线程，我能够在 500 毫秒到 1.5 秒左右处理 1 个文件。 When I pull up the activity monitor on my sys, I can see that at a time, any 1 core will be pinned to 100%.当我在我的系统上打开活动监视器时，我可以看到一次，任何 1 个核心都将被固定为 100%。 But when I do the multithreading stuff, all 10 cores would be sitting at around 10% to 20% only.但是当我做多线程的事情时，所有 10 个内核将只占 10% 到 20% 左右。 And the execution isn't that fast compared to my non-threading stuff.与我的非线程内容相比，执行速度并没有那么快。

I have 6464 files and my sys is either i5 8300H (4 cores) or i9 10900 (10 cores)我有 6464 个文件，我的系统是 i5 8300H（4 核）或 i9 10900（10 核）

Here in find_files(), I basically made the script to find the files that I need.在 find_files() 中，我基本上编写了脚本来查找我需要的文件。 Some of the variables are initialized.一些变量被初始化。 I simply didn't put it here to make the code complicated.我只是没有把它放在这里使代码变得复杂。 Now at the end, you can see a thread helper function which basically is used to divide the calls to multiple threads.现在最后，您可以看到一个线程辅助函数，它基本上用于将调用划分为多个线程。

def find_files():
internal_count = 0

output_path = script_path + "/output"
fam_dirs = os.listdir(output_path)
for fam_dir in fam_dirs:
    if fam_dir in malware_families:
        files = os.listdir(output_path + "/" + fam_dir + "/")
        for file in files:
            file_path = output_path + "/" + fam_dir + "/" + file
            internal_count += 1
            thread_helper(file_path, file, fam_dir, internal_count)

Here you can see the thread helper takes the 'count' arg and divides the calls to specific threads to evenout the load.在这里，您可以看到线程助手采用“计数”参数并将调用划分给特定线程以平衡负载。 I've even tried doing thread.join() too.我什至也尝试过 thread.join() 。 But same issue.但同样的问题。

def thread_helper(file_path, file, fam_dir, count):
if count % 4 == 0:
    t4 = threading.Thread(target=generate_feature_vector, args=(file_path, file, fam_dir))
    t4.start()
elif count % 3 == 0:
    t3 = threading.Thread(target=generate_feature_vector, args=(file_path, file, fam_dir))
    t3.start()
if count % 2 == 0:
    t2 = threading.Thread(target=generate_feature_vector, args=(file_path, file, fam_dir))
    t2.start()
else:
    t1 = threading.Thread(target=generate_feature_vector, args=(file_path, file, fam_dir))
    t1.start()


# wait until threads are completely executed
# t1.join()
# t2.join()
# t3.join()
# t4.join()

So in this function I basically take a file, open it, take the input lines one by one and see where it is in the unique list and I +1 the value in the feature vector list at that position corresponding to the unique list.所以在这个函数中，我基本上取一个文件，打开它，一行一行地取输入行，看看它在唯一列表中的位置，然后 I +1 特征向量列表中与唯一列表相对应的位置的值。

def generate_feature_vector(path, file_name, family):
try:
    with open(path, 'r') as file:
        lines = file.read()
        lines = lines.split("\n")
        
        feature_vector = list()

        # Init feature_vector to 0 with the length of the unique list of that feature
        for i in range(len(feature_ls)):
            feature_vector.append([0] * len(unique_ls[i]))
        

        
        # +1 the value in the cell corresponding to the file_row and feature_name_column and append family name
        for line in lines:
            for i in range( len(feature_ls) ):
                if line in unique_ls[i]:
                    feature_vector[i][ unique_ls[i].index(line) ] += 1
                feature_vector[i][ len(feature_vector[i])-1 ] = family

I'm not sure where I messed up.我不确定我在哪里搞砸了。 So can you guys helpz me divide the load to multiple CPU cores?那么你们能帮我将负载分配给多个 CPU 内核吗？ Also please do correct me if I messed up terms while asking this question....如果我在问这个问题时弄乱了术语，也请纠正我....

Answer 1

Because of python's GIL, only one thread is executed at once, making threading effective only for I/O bounded applications, not CPU-bounded, like yours.由于 python 的 GIL，一次只执行一个线程，使threading只对 I/O 受限的应用程序有效，而不是 CPU 受限的应用程序，比如你的。

To parallelize your program, I recommend using multiprocessing module, which is quite similar to threading in its use but implements true parallelism.为了使您的程序并行化，我建议使用multiprocessing模块，它在使用上与threading非常相似，但实现了真正的并行性。

Answer 2

So as said by Joao Donasolo, I tried multiprocessing.正如 Joao Donasolo 所说，我尝试了多处理。 And this whole time, I was just looking at the wrong place.而这整段时间，我只是看错了地方。 I was mistaking multiprocessing with multithreadding.我把多处理误认为是多线程。 For my specific issue, multiprocessing module helped me.对于我的具体问题，多处理模块帮助了我。 I haven't run and benchmark to compare these, but as soon as I'm able to benchmark, I'll update the stats and code here.我还没有运行和基准测试来比较这些，但是一旦我能够进行基准测试，我就会在这里更新统计数据和代码。

如何在我的 python 脚本中实现多处理？

问题描述

2 个解决方案

解决方案1
0 已采纳 2022-06-03 23:00:34

解决方案2
0 2022-06-05 15:31:32

如何在我的 python 脚本中实现多处理？

问题描述

2 个解决方案

解决方案1 0 已采纳 2022-06-03 23:00:34

解决方案2 0 2022-06-05 15:31:32

解决方案1
0 已采纳 2022-06-03 23:00:34

解决方案2
0 2022-06-05 15:31:32