multiprocessing.Pool and slurm

Question

I wrote a simple function that calculates the number of lines in a file.

def line_count(file_name):
    temp = 0
    with open(file_name,'r') as ref:
        ref.readline()
        for line in ref:
            temp+=1
    return temp,file_name

Then I use Pool to apply it to each file I have in a folder:

files=[]
for f in glob.glob(data_directory+'/*.txt'):
    files.append(f)
with Pool(np.min([tot_process,len(files)])) as pool:
    rt = pool.map(line_count,files)

where tot_process is an argument I pass from a slurm script which runs the Python code. In the slurm script, I have the following header:

#!/bin/bash
#SBATCH --cpus-per-task=60
#SBATCH --partition=normal
#SBATCH --ntasks=1
#SBATCH --mem=0
#SBATCH --job-name=test

Since the function takes about 4 seconds for a 40M-lines files, I would expect that running the code with tot_process=60 on a folder with 60 files would take approximately the same. Instead, it takes around 240 seconds. I suspect I am missing something very basic here and my script is not running multiprocessing as it should.

Answer 1

You are I/O bound. No matter how many processes you create, the hard drive itself can only read the files so fast. It must seek to a file block, read at the speed of the spinning disk and repeat for as many blocks are in the file. And the disk is attached to the computer via some sort of bus - likely SATA or SCSI - and it has its own speed limitations (though likely much faster than the disk itself). An SSD drive is faster than a spinning disk but the general rule still applies.

40 million lines suggests files in the range of a gig. At 4 seconds you are getting 250 MB data transfer rate, which is quite fast (assuming a standard locally attached hard drive).

You may get some speedup by skipping multiprocessing completely and switching to a memory mapped file. Here is an example of reading in fairly large blocks with Direct I/O. I haven't profiled this, so its just a guess.

import os
import mmap

LINE_COUNT_BLOCKSIZE = 2**25   # 32 Meg

def line_count(filename):
    fd = os.open(filename, os.O_RDONLY | os.O_DIRECT)
    try:
        with mmap.mmap(fd, 0, access=mmap.ACCESS_READ) as mm:
            count = 0
            while True:
                buf = mm.read(LINE_COUNT_BLOCKSIZE)
                if not buf:
                    break
                count += buf.count(b"\n")
            return count
    finally:
        os.close(fd)

multiprocessing.Pool and slurm

Question

1 answers

solution1
2 ACCPTED 2020-09-24 01:24:35

multiprocessing.Pool and slurm

Question

1 answers

solution1 2 ACCPTED 2020-09-24 01:24:35

solution1
2 ACCPTED 2020-09-24 01:24:35