I want to process a data ( contained in a disk file later loaded into numpy.array
instance ) with rows like:
1 3 a
1 4 b
1 5 a
2 6 b
where the first column is a start time, second column is an end time and third column is an id. I want to process these data so as to identify a number of ids at each start time like:
1 2
2 2
3 2
4 2
5 2
6 1
where first column is a start time and second column is number of ids
I have written the following code to process it as:
j=[] # a list of ids
for i in range( len( dataset1 ) ):
indices = numpy.argwhere( ( dataset1[i,0] >= dataset[:,0] )
& ( dataset1[i,0] <= dataset[:,1] )
)
j.append( len( set( dataset[indices[:,0],2] ) ) )
where:
- dataset1
has first column as 1,2,3,4,5,6 timestamps, and
- dataset
has three columns: start time, end time and id.
I have to process about 9 hundred million rows as given in dataset1
. This is very slow.
I tried to parallelize it as:
inputs = range( len( dataset1 ) )
def processInput( b ):
indices = numpy.argwhere( ( b >= dataset[:,0] )
& ( b <= dataset[:,1] )
)
return( len( set( dataset[indices[:,0],2] ) ) )
num_cores = 10
results = Parallel( n_jobs = num_cores )( delayed( processInput )( dataset[j,0] ) for j in inputs )
But this is still slow. I have 10 more cores available but then disk becomes bottleneck.
Is there any way to process this data fast?
Q : Is there any way to process this data fast ?
Yes, there is.
( Python threads do not help here a bit, due to GIL-locking (re-serialising all efforts in a pure- [SERIAL]
sequential processing, with add-on overheads for increased intensity of the hunt for the GIL-lock acquisition ), Python process -based parallelism is expensive and replicates all RAM-data, including the itnerpreter as many times one asks ( disk blocks, because it swaps RAM, not because of smoothly reading down the lane but some < 1E9
data-rows in a file, sure, unless you have multi- [TB] RAM-device to hold all python-process/data-copies at once ) )
prepare data-file to best suite your further trivial counting of id
-s
sort -k1,1 \
-k3 \
--parallel=19 \
--output=dataset1_SORTED_DATA.txt < dataset1_data_file.txt
sort --parallel=19 \
--output=dataset_SORTED_GATEs.txt < dataset_T1_T2_GATEs_data_file.txt
The first dataset1_SORTED_DATA.txt
file next simply process ( count continuous, sequential blocks of rows matching the conditions ), reading it just once, sequentially, as dictated by using the also sorted < T1_start, T2_end >
-gates, prepared in the second dataset_SORTED_GATEs.txt
file.
This, almost a stream-processing, is smooth and uses just a plain counting of rows, that meet both the conditions from ...SORTED_GATEs.txt
data file, where < T1, T2 >
-gates, that again monotonically grow larger and larger, so the first ...SORTED_DATA.txt
file gets processed in a one, smooth pass-through, counting just the id
-s, as was requested.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.