简体   繁体   English

在 16 个 CPU 而不是 1 个 CPU 上运行 python 脚本

[英]Running a python script on 16 CPUs instead of 1 CPU

I have a bash script which activate a python script:我有一个激活 python 脚本的 bash 脚本:

#!/bin/bash
#SBATCH -J XXXXX
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16

python my_python_script.py

The python script is scanning a very large file (~480,000,000 rows) and creates a dictionary that will later be written as an output file: python 脚本正在扫描一个非常大的文件(约 480,000,000 行)并创建一个字典,该字典稍后将被写入 output 文件:

with open (huge_file,'r') as hugefile, open (final_file, 'w') as final:
  reader= csv.reader (hugefile, delimiter="\t")
  writer= csv.writer (final, delimiter="\t")

  d={} 

  for r in reader:
      v=r[0]+r[1]
      if v not in d.keys():
        d[v]=[r[5],r[4]]
      else:
        d[v].append([r[5],r[4]])

  for k,v in d.items():
    #analyses    
    nl = [different variables]
    writer.writerow(nl)

Due to the size of the file, I want to use 16 CPUs for the run, yet even though I defined 16 CPUs in my bash script, It only uses 1 CPU.由于文件的大小,我想使用 16 个 CPU 来运行,但即使我在 bash 脚本中定义了 16 个 CPU,它也只使用 1 个 CPU。

I read a lot about subprocess yet it does not seem to apply in this case.我读了很多关于subprocess的内容,但它似乎不适用于这种情况。 I would love to hear any suggestions.我很想听听任何建议。

I would suggest to use multiprocessing pool to fill the dict.我建议使用多处理池来填充字典。

from multiprocessing import Pool

d = dict()

def func(r):
    v = r[0]+r[1]
    if v not in d:
        d[v] = [r[5], r[4]]
    else:
        d[v].append([r[5], r[4]])

with Pool(16) as p:
    p.map(func, reader)

Similarly, the analysis can be done applying the Pool to the dict d and an analysis function.类似地,可以将 Pool 应用于 dict d 和分析 function 来完成分析。

Cores won't help you here, as the dictionary manipulation is trivial and extremely fast.核心在这里帮不了你,因为字典操作很简单而且速度非常快。

You have an I/O issue here, where reading and writing the files is the bottleneck.这里有一个 I/O 问题,读取和写入文件是瓶颈。

If you use the multiprocessing module you may run into other issues.如果您使用多处理模块,您可能会遇到其他问题。 The dictionaries which get built will be independent of each other, so you will have duplicate keys each with other data.构建的字典将彼此独立,因此您将拥有与其他数据重复的键。 If the ordering of the CSV data must be kept, maybe because it is timeseries data, you will have to merge and then sort the arrays in the dictionary as an additional step, unless you take this problem into account while merging the 16 dictionaries.如果必须保留 CSV 数据的顺序,可能是因为它是时间序列数据,则必须将字典中的 arrays 作为附加步骤进行合并然后排序,除非在合并 16 个字典时考虑到这个问题。 This also means that you will be breaking up the CSV into 16 chunks and processing them individually on each core, so that you can keep track of the ordering.这也意味着您将 CSV 分成 16 个块并在每个内核上单独处理它们,以便您可以跟踪排序。

Have you considered reading the huge CSV file into a SQLite database?您是否考虑过将巨大的 CSV 文件读入 SQLite 数据库? This would at least give you more control on how the data is accessed, since 16 processes could access the data at the same time while specifying the ordering.这至少可以让您更好地控制数据的访问方式,因为 16 个进程可以在指定排序时同时访问数据。

I really doubt that there is anything to parallelize here.我真的怀疑这里有什么可以并行化的。 Even if you use the multiprocessing module, you need to write the entire file while taking the entire dictionary into account, which restricts your way to parallelize this task.即使您使用多处理模块,您也需要编写整个文件同时考虑整个字典,这限制了您并行化此任务的方式。

Multiprocessing is difficult to apply because of everything needs to be sorted into a central dict d.多处理很难应用,因为所有内容都需要分类到一个中心 dict d 中。 Several processes will consistently have to know which keys are already in the dict, and that makes it really complex.几个进程必须始终知道字典中已经有哪些键,这使得它变得非常复杂。 So, the easier solution is to try speeding up processing while staying within one process.因此,更简单的解决方案是尝试加快处理速度,同时保持在一个进程内。 dict and list comprehension seems to be a good way forward: dict 和列表理解似乎是一个很好的前进方向:

# prepare dict keys and empty list entries:
d = {r[0]+r[1]: [] for r in reader}

# fill dict
[d[r[0]+r[1]].append([r[5], r[4]]) for r in reader]

# d is ready for analysis

You can use threading to do this.您可以使用线程来执行此操作。 First, separate the codes in a function doing something like this:首先,将 function 中的代码分开,执行如下操作:

import threading

def your_func_name(result, reader, index):
    sliced_d = {}
    for r in reader:
        v=r[0]+r[1]
        if v not in sliced_d.keys():
            sliced_d[v]=[r[5],r[4]]
        else:
            sliced_d[v].append([r[5],r[4]])

    for k,v in sliced_d.items():
    #analyses    
        nl = [different variables]
        writer.writerow(nl)
    result[index] = sliced_d


Now, define how many CPUs you want to use and slice the reader accordingly.现在,定义您想要使用的 CPU 数量并相应地对阅读器进行切片。 Then, send the slices into the threads.然后,将切片发送到线程中。

d = {}
cpus = 16
slice = len(reader)//cpus
results = [None] * slice
for i in range(cpus):
    if i < spawns-1:
        threads.append(Thread(your_func_name, (result, reader[i*slice, (i+1)*slice], i)))
    else:
        threads.append(Thread(your_func_name (result, reader[i*tot_threads:len(reader)], i))


for i in range(cpus):
    threads[i].join()

for i in range(cpus):
    d.update(result[i])

Note: The code might have some bugs as I wrote this kind of like an example.注意:代码可能有一些错误,因为我写了这样一个例子。

Here is an idea how to use multiple processes (not yet tested with a large file, but confident it will work when debugged).这是一个如何使用多个进程的想法(尚未使用大文件进行测试,但确信它会在调试时工作)。

Step 1 is to split the huge file into segments using the Linux split function:第 1 步是使用 Linux split function 将大文件拆分为段:

 bash> split -l 10000000 hugefile segment

This will create files each having 10,000,000 lines and the names will be segmentaa, segmentab, .... (see man page of split )这将创建每个有 10,000,000 行的文件,名称将是segmentaa, segmentab, .... (参见splitman页)

Now the Python program to read these file segments, launch one process per file segment and then consolidate the results into one dict:现在 Python 程序读取这些文件段,每个文件段启动一个进程,然后将结果合并到一个字典中:

import multiprocessing as mp
import csv


# define process function working on a file segment
def proc_target(q, filename):
    with open(filename, 'r') as file_segment:
        reader = csv.reader(file_segment, delimiter="\t")

        dd = dict()

        def func(r):
            key = r[0] + r[1]
            if key in dd:
                dd[key].append([r[5], r[4]])
            else:
                dd[key] = [r[5], r[4]]

        [func(r) for r in reader]

        # send result via queue to main process
        q.put(dd)


if __name__ == '__main__':
    segment_names = ['segmentaa', 'segmentab', 'segmentac']:  # maybe there are more file segments ...
    processes = dict()  # all objects needed are stored in this dict
    mp.set_start_method('spawn')

    # launch processes
    for fn in segment_names :
        
        processes[fn] = dict()

        q = mp.Queue()
        p = mp.Process(target=proc_target, args=(q, fn))
        p.start()

        processes[fn]["process"] = p
        processes[fn]["queue"] = q

    # read results
    for fn in segment_names:
        processes[fn]["result"] = processes[fn]["queue"].get()
        processes[fn]["process"].join()

    # consolidate all results
    # start with first segment result and merge the others into it
    d = processes[segment_names[0]]["result"]

    # helper function for fast execution using list comprehension
    def consolidate(key, value):
        if key in d:
            d[key].append(value)
        else:
            d[key] = value

    # merge other results into d
    for fn in segment_names[1:]:
        [consolidate(key, value) for key, value in processes[fn]["result"]]

    # d is ready

In order to avoid I/O bottlenecks, it might be wise to distribute the segments over several disks and let the parallel processes access different I/O resources in parallel.为了避免 I/O 瓶颈,明智的做法是将段分布在多个磁盘上,并让并行进程并行访问不同的 I/O 资源。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM