简体   繁体   中英

Python: slow processing of million records

I want to process a data ( contained in a disk file later loaded into numpy.array instance ) with rows like:

1 3 a
1 4 b
1 5 a
2 6 b

where the first column is a start time, second column is an end time and third column is an id. I want to process these data so as to identify a number of ids at each start time like:

1  2
2  2
3  2
4  2
5  2
6  1

where first column is a start time and second column is number of ids

I have written the following code to process it as:

j=[]                                          # a list of ids
for i in range( len( dataset1 ) ):
    indices = numpy.argwhere( ( dataset1[i,0] >= dataset[:,0] )
                            & ( dataset1[i,0] <= dataset[:,1] )
                              )
    j.append( len( set( dataset[indices[:,0],2] ) ) )

where:
- dataset1 has first column as 1,2,3,4,5,6 timestamps, and
- dataset has three columns: start time, end time and id.

I have to process about 9 hundred million rows as given in dataset1 . This is very slow.

I tried to parallelize it as:

inputs = range( len( dataset1 ) )

def processInput( b ):
    indices = numpy.argwhere( ( b >= dataset[:,0] )
                            & ( b <= dataset[:,1] )
                              )
    return( len( set( dataset[indices[:,0],2] ) ) )

num_cores = 10

results = Parallel( n_jobs = num_cores )( delayed( processInput )( dataset[j,0] ) for j in inputs )

But this is still slow. I have 10 more cores available but then disk becomes bottleneck.

Is there any way to process this data fast?

Q : Is there any way to process this data fast ?

Yes, there is.

( Python threads do not help here a bit, due to GIL-locking (re-serialising all efforts in a pure- [SERIAL] sequential processing, with add-on overheads for increased intensity of the hunt for the GIL-lock acquisition ), Python process -based parallelism is expensive and replicates all RAM-data, including the itnerpreter as many times one asks ( disk blocks, because it swaps RAM, not because of smoothly reading down the lane but some < 1E9 data-rows in a file, sure, unless you have multi- [TB] RAM-device to hold all python-process/data-copies at once ) )


Step 1:
setup an efficient flow of DATA into a just-enough efficient processing

prepare data-file to best suite your further trivial counting of id -s

sort -k1,1       \
     -k3          \
     --parallel=19 \
     --output=dataset1_SORTED_DATA.txt < dataset1_data_file.txt

sort --parallel=19 \
     --output=dataset_SORTED_GATEs.txt < dataset_T1_T2_GATEs_data_file.txt

Step 2:
sequentially process the sorted-file as per "gating" read from 2nd file

The first dataset1_SORTED_DATA.txt file next simply process ( count continuous, sequential blocks of rows matching the conditions ), reading it just once, sequentially, as dictated by using the also sorted < T1_start, T2_end > -gates, prepared in the second dataset_SORTED_GATEs.txt file.

This, almost a stream-processing, is smooth and uses just a plain counting of rows, that meet both the conditions from ...SORTED_GATEs.txt data file, where < T1, T2 > -gates, that again monotonically grow larger and larger, so the first ...SORTED_DATA.txt file gets processed in a one, smooth pass-through, counting just the id -s, as was requested.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM