So I've been playing around with the multiprocessing
module, trying to figure out ways to speed up a lot of the work I do with pandas
DataFrames.
The example I was working with was taking a sequence of Excel files, each one representing a years worth of data, turning them into a dataframe and then summing one of the columns. Sequentially, something like this:
now = time.time()
dict = {}
table_2010 = pd.read_excel('2010.xlsx')
table_2011 = pd.read_excel('2011.xlsx')
table_2012 = pd.read_excel('2012.xlsx')
table_2013 = pd.read_excel('2013.xlsx')
table_2014 = pd.read_excel('2014.xlsx')
table_2015 = pd.read_excel('2015.xlsx')
dict[2011] = table_2011[[95]].sum()
dict[2010] = table_2010[[95]].sum()
dict[2012] = table_2012[[95]].sum()
dict[2013] = table_2013[[95]].sum()
dict[2014] = table_2014[[95]].sum()
dict[2015] = table_2015[[95]].sum()
print dict
print time.time() - now
This took me 205 seconds , the Excel files are sizable and take a while to load into a dataframe, and I assumed that running it in parallel would improve that performance. What I came up with was this:
def func(year):
table = pd.read_excel(str(year) + '.xlsx')
dict[year] = table[[95]].sum()
if __name__ == '__main__':
now = time.time()
dict = {}
pool = ThreadPool(8)
pool.map_async(func, [2010,2011,2012,2013,2014,2015])
pool.close()
pool.join()
print dict
print time.time() - now
When I ran this though, it ended up taking 250 seconds . It was my impression that having separate cores run each of these processes would improve performance, is that incorrect?
Or is there an issue with the script I wrote?
Slower?
Depends.
Depends, a lot .
Is there an issue with the script?
Yes, a brute one ( still no need to worry or panic - a well solvable one ). Enjoy the read.
# =========================================================================[sec]
an-<iterator>-based SERIAL processing of 9 CPU-bound tasks took 1290.538 [sec]
aThreadPool(6)-based TPOOL processing of 9 CPU-bound tasks took 1212.065 [sec]
aPool(6)-based POOL processing of 9 CPU-bound tasks took 271.765 [sec]
# =========================================================================[sec]
multiprocessing
has several Pool
-s Based on not a fully documented MCVE-above ( missing all explicit namespace import
-s to safely disambiguate the setup for an intended use-case), let's start with your code mentioning ThreadPool.map_async()
and processing many Excel files.
One could hardly start a worse approach for the intended fast processing.
Pool
slower than SEQ
? Having purposely borrowed a natively-parallel occam
-language syntax, the question tends to lead to the pain of a PAR
| SEQ
dilemma in designing high-performance systems ( Yes, HPC, sure - guess, who would ever like to design slow systems on purpose, right? ).
The issue is sufficiently multi-fold to ask even more questions before being able to seriously approach the answer to the initial dilemma.
What resources do we have?
What type of operations are to be executed in a PAR
| SEQ
arrangement:
is the problem purely { CPU-bound | IO-bound }?
is the problem in a need to share-{ state | data } during processing?
is the problem in a need to communicate-{ signals | messages } during processing?
A CPU-bound processing is much simpler to mock-up ( and a way "greener" to wear & tear precious physical HPC resources ), so let's start to use a primitive function:
def aFatCALCULUS( id ): # an INTENSIVE CPU-bound WORKLOAD
import numpy as np
import os
pass; aST = "aFatCALCULUS( {1:>3d} ) [PID:: {0:d}] RET'd {2:d}"
return( aST.format( os.getpid(),
id,
id + len( str( [ np.math.factorial( 2**f ) for f in range( 20 ) ][-1] ) )
)
)
Now, let's execute this one several times, in different arrangements.
Forgive my non-PEP-8 format ( we do not sponsor any core refactoring with the demonstrations made here, so indeed nobody serious shall ever feel this choice inappropriate in whatever sense ).
from multiprocessing.pool import ThreadPool # ThreadPool-mode
from multiprocessing import Pool # Pool-mode
pass; import time
print( "{0:}----------------------------------------------------------- # SETUP:".format( time.ctime() ) )
aListOfTaskIdNUMBERs = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, ]
print( "{0:}----------------------------------------------------------- # PROCESSING-ThreadPool mode of EXECUTION:".format( time.ctime() ) )
aTPool = ThreadPool( 6 ) # PROCESSING-ThreadPool.capacity == 6
print( "{0:}----------------------------------------------------------- # SERIAL mode of EXECUTION:".format( time.ctime() ) )
start = time.clock_gettime( time.CLOCK_MONOTONIC_RAW );
pass; [ aFatCALCULUS( id ) for id in aListOfTaskIdNUMBERs ] # SERIAL <iterator>-driven mode of EXECUTION
pass; duration = time.clock_gettime( time.CLOCK_MONOTONIC_RAW ) - start; print( "an-<iterator>-based SERIAL processing of {1:}-tasks took {0:} [sec]".format( duration, len( aListOfTaskIdNUMBERs ) ) )
pass;
print( "{0:}----------------------------------------------------------- # PROCESSING-Pool mode of EXECUTION:".format( time.ctime() ) )
aPool = Pool( 6 ) # PROCESSING-Pool.capacity == 6
start = time.clock_gettime( time.CLOCK_MONOTONIC_RAW );
pass; aPool.map( aFatCALCULUS, aListOfTaskIdNUMBERs ) # PROCESSING-Pool-driven mode of EXECUTION
pass; duration = time.clock_gettime( time.CLOCK_MONOTONIC_RAW ) - start; print( "aPool(6)-based processing of {1:}-tasks took {0:} [sec]".format( duration, len( aListOfTaskIdNUMBERs ) ) )
print( "{0:}----------------------------------------------------------- # END.".format( time.ctime() ) )
PID
#s aPool(6).map()
["aFatCALCULUS( 1 ) [PID:: 898] RET'd 2771011",
"aFatCALCULUS( 2 ) [PID:: 899] RET'd 2771012",
"aFatCALCULUS( 3 ) [PID:: 900] RET'd 2771013",
"aFatCALCULUS( 4 ) [PID:: 901] RET'd 2771014",
"aFatCALCULUS( 5 ) [PID:: 902] RET'd 2771015",
"aFatCALCULUS( 6 ) [PID:: 903] RET'd 2771016",
"aFatCALCULUS( 7 ) [PID:: 898] RET'd 2771017",
"aFatCALCULUS( 8 ) [PID:: 899] RET'd 2771018",
"aFatCALCULUS( 9 ) [PID:: 903] RET'd 2771019"
]
aThreadPool(6)
["aFatCALCULUS( 1 ) [PID:: 16125] RET'd 2771011",
"aFatCALCULUS( 2 ) [PID:: 16125] RET'd 2771012",
"aFatCALCULUS( 3 ) [PID:: 16125] RET'd 2771013",
"aFatCALCULUS( 4 ) [PID:: 16125] RET'd 2771014",
"aFatCALCULUS( 5 ) [PID:: 16125] RET'd 2771015",
"aFatCALCULUS( 6 ) [PID:: 16125] RET'd 2771016",
"aFatCALCULUS( 7 ) [PID:: 16125] RET'd 2771017",
"aFatCALCULUS( 8 ) [PID:: 16125] RET'd 2771018",
"aFatCALCULUS( 9 ) [PID:: 16125] RET'd 2771019"
]
A supercomputer turns compute-bound problems into I/O bound problems ( S. Cray )
Obey Seymour CRAY's wisdom with all due humility,
but
do not let others make you the one,
who pays the costs of their missing HPC duties
on your side of the CPU-budget.
IMHO, if this were my HPC-task,
I would
avoid paying pandas
XLSX-import / transformation costs
make Excel-data- owner / processor to guarrantee and enforce their automated column- SUM()
-{ auto | manual | scripted }- update on each data-element change/update arriving in time, be it in batch or by an event, on their data-store side
go for the fastest ( distributed-processing ) architecture with the powers of independent multiprocessing.Pool().map()
processes not reading ( == moving all heaps of data ) but using a smart, direct-access, just to cells ( == elements ) you need to process.
PAR
arranged Pool()
is faster than any other SEQ
processing: ''' REAL SYSTEM:: multiprocessing.Pool(6).map()
_________________________________________________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________________________________________________
top - 22:24:42 up 84 days, 23:05, 4 users, load average: 4.80, 2.17, 0.86
Threads: 366 total, 5 running, 361 sleeping, 0 stopped, 0 zombie
%Cpu0 : 75.7/0.0 76[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
%Cpu1 : 0.1/0.0 0[ ]
%Cpu2 : 0.0/0.0 0[ ]
%Cpu3 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu4 : 0.1/0.0 0[ ]
%Cpu5 : 0.0/0.0 0[ ]
%Cpu6 : 76.2/0.0 76[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
%Cpu7 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu8 : 0.0/0.0 0[ ]
%Cpu9 : 75.5/0.0 76[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
%Cpu10 : 0.5/0.4 1[ ]
%Cpu11 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu12 : 0.0/0.0 0[ ]
%Cpu13 : 0.0/0.0 0[ ]
%Cpu14 : 0.0/0.0 0[ ]
%Cpu15 : 0.7/0.5 1[|| ]
%Cpu16 : 0.0/0.0 0[ ]
%Cpu17 : 0.0/0.0 0[ ]
%Cpu18 : 0.0/0.0 0[ ]
%Cpu19 : 0.0/0.0 0[ ]
KiB Mem : 24522940 total, 22070528 free, 778080 used, 1674332 buff/cache
KiB Swap: 8257532 total, 7419136 free, 838396 used. 22905264 avail Mem
P S %CPU PPID PID nTH TIME+ USER PR NI RES CODE SHR DATA %MEM VIRT vMj vMn SWAP nsIPC COMMAND
1 S 0.0 1614 1670 1 10:54.15 root 20 0 632 740 416 1396 0.0 52140 0 0 1172 - `- haproxy
2 S 0.0 1614 1671 1 35:40.50 root 20 0 664 740 380 1528 0.0 52272 0 0 1172 - `- haproxy
19 S 0.0 1 1658 1 6:20.42 root 20 0 22960 468 14380 14240 0.1 466344 0 0 836 - `- httpd
12 S 0.0 1 24217 1 4:31.41 root 20 0 3984 8 668 7320 0.0 155304 0 0 3864 - `- munin-node
0 R 0.0 12882 4964 1 0:31.53 m 20 0 2596 96 1524 1596 0.0 158096 0 0 0 4026531839 `- top
0 S 0.1 15213 22779 22 0:11.16 m 20 0 54052 2268 5816 1965528 0.2 2191768 0 0 0 4026531839 `- python3
1 S 0.1 15213 23613 22 0:10.83 m 20 0 54052 2268 5816 1965528 0.2 2191768 0 0 0 4026531839 `- python3
7 R 99.9 16125 898 1 2:29.72 m 20 0 52084 2268 1336 1969112 0.2 2195352 0 3k 0 4026531839 `- python3
11 R 99.9 16125 899 1 2:29.72 m 20 0 52088 2268 1336 1969116 0.2 2195356 0 3k 0 4026531839 `- python3
6 S 76.3 16125 900 1 2:15.49 m 20 0 49724 2268 1236 1965520 0.2 2191760 0 777 0 4026531839 `- python3
0 S 75.7 16125 901 1 2:15.12 m 20 0 49732 2268 1236 1965524 0.2 2191764 0 775 0 4026531839 `- python3
9 S 75.6 16125 902 1 2:15.05 m 20 0 49732 2268 1236 1965524 0.2 2191764 0 775 0 4026531839 `- python3
3 R 99.9 16125 903 1 2:29.70 m 20 0 52100 2268 1336 1969120 0.2 2195360 0 3k 0 4026531839 `- python3
4 S 0.1 15213 904 22 0:00.36 m 20 0 54052 2268 5816 1965528 0.2 2191768 0 0 0 4026531839 `- python3
15 S 1.2 19285 21279 2 21:27.31 a 20 0 75720 2268 12940 196868 0.3 642876 0 0 0 - `- python3
8 S 0.0 19285 21281 2 0:14.88 a 20 0 75720 2268 12940 196868 0.3 642876 0 0 0 - `- python3
10 S 0.9 22118 22120 2 20:07.34 a 20 0 56604 2268 7176 464808 0.2 722164 0 0 0 - `- python3
4 S 0.0 22118 22122 2 0:19.39 a 20 0 56604 2268 7176 464808 0.2 722164 0 0 0 - `- python3
4 S 0.0 2 29 1 33:46.57 root 20 0 0 0 0 0 0.0 0 0 0 0 - `- rcu_sched
_________________________________________________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________________________________________________
top - 22:25:31 up 84 days, 23:06, 4 users, load average: 3.78, 2.30, 0.97
Threads: 365 total, 4 running, 361 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.2/0.4 1[ ]
%Cpu1 : 0.0/0.0 0[ ]
%Cpu2 : 0.0/0.0 0[ ]
%Cpu3 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu4 : 0.2/0.0 0[ ]
%Cpu5 : 0.0/0.0 0[ ]
%Cpu6 : 0.0/0.0 0[ ]
%Cpu7 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu8 : 0.0/0.0 0[ ]
%Cpu9 : 0.0/0.0 0[ ]
%Cpu10 : 0.6/0.4 1[| ]
%Cpu11 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu12 : 0.0/0.0 0[ ]
%Cpu13 : 0.0/0.0 0[ ]
%Cpu14 : 0.0/0.0 0[ ]
%Cpu15 : 0.6/0.6 1[|| ]
%Cpu16 : 0.0/0.0 0[ ]
%Cpu17 : 0.0/0.0 0[ ]
%Cpu18 : 0.0/0.0 0[ ]
%Cpu19 : 0.0/0.0 0[ ]
KiB Mem : 24522940 total, 22076660 free, 772436 used, 1673844 buff/cache
KiB Swap: 8257532 total, 7419136 free, 838396 used. 22911364 avail Mem
P S %CPU PPID PID nTH TIME+ USER PR NI RES CODE SHR DATA %MEM VIRT vMj vMn SWAP nsIPC COMMAND
2 S 0.2 1614 1671 1 35:40.51 root 20 0 664 740 380 1528 0.0 52272 0 0 1172 - `- haproxy
0 R 0.4 12882 4964 1 0:31.66 m 20 0 2596 96 1524 1596 0.0 158096 0 0 0 4026531839 `- top
7 R 99.9 16125 898 1 3:18.45 m 20 0 52608 2268 1336 1969112 0.2 2195352 0 9 0 4026531839 `- python3
11 R 99.9 16125 899 1 3:18.46 m 20 0 52612 2268 1336 1969116 0.2 2195356 0 9 0 4026531839 `- python3
3 R 99.9 16125 903 1 3:18.43 m 20 0 52624 2268 1336 1969120 0.2 2195360 0 10 0 4026531839 `- python3
15 S 1.2 19285 21279 2 21:27.92 a 20 0 75720 2268 12940 196868 0.3 642876 0 0 0 - `- python3
10 S 1.0 22118 22120 2 20:07.81 a 20 0 56604 2268 7176 464808 0.2 722164 0 0 0 - `- python3
_________________________________________________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________________________________________________
'''
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.