简体   繁体   中英

How to optimize the search algorithm for large 2d-array?

After adding a line of code

pathResult.append(find_max_path(arr[a + 1:, b + 1:], path, 1))

began to run slowly, but without this code it does not work correctly. How can i optimize the code? The function looks for the path with the maximum number of points in a two-dimensional array where values equal to 100 lie predominantly on the main diagonal. Rows can have the same value equal to 100, but in any column the value 100 is one or none. Full code:

import numpy as np

arr = np.array([
    [000,000,000,000,000,000,000],
    [000,000,000,000,000,100,000],
    [000,000,000,000,000,000,000],
    [000,000,100,000,000,000,000],
    [000,000,000,100,000,000,000],
    [000,000,000,000,100,000,000],
    [000,000,000,000,000,000,000],
    [000,100,000,000,000,000,000]])

def find_max_path(arr, path=None, countempty=0):
    if path is None:
        path = []
    a = 0
    b = 0
    while (a < len(arr)) and (b < len(arr[a])):
        if arr[a][b] == 100:
            path.append({"a": 1 + countempty, "b": 1})
            countempty = 0
            a += 1
            b += 1
            continue
        else:
            check = []
            for j in range(b + 1, len(arr[a])):
                if arr[a][j] == 100:
                    check.append({"arr": arr[a + 1:, j + 1:],
                                  "a": 1 + countempty,
                                  "b": j - b + 1})
                    break
            if not check:
                countempty += 1
                a += 1
                continue
            i = a
            while i < len(arr):
                if arr[i][b] == 100:
                    check.append({"arr": arr[i + 1:, b + 1:],
                                  "a": i - a + 1,
                                  "b": 1})
                    break
                i += 1
            pathResult = []
            for c in check:
                pathNew = path[:]
                pathNew.append({"a": c["a"], "b": c["b"]})
                pathResult.append(find_max_path(c["arr"], pathNew))
            maximum = 0
            maxpath = []
            pathResult.append(find_max_path(arr[a + 1:, b + 1:], path, 1))
            for p in pathResult:
                if len(p) > maximum:
                    maximum = len(p)
                    maxpath = p[:]
            if maxpath:
                return maxpath
            else:
                countempty += 1
        a += 1
    return path

print(find_max_path(arr))

UPDATE1: add two break in inner loops (execution time is halved)

Output:

[{'a': 3, 'b': 2}, {'a': 1, 'b': 1}, {'a': 1, 'b': 1}]

UPDATE2

Usage. I use this algorithm to synchronize two streams of information. I have words from the text along the lines, about which it is known where they are in the text of the book L_word . By columns, I have recognized words from the audiobook, about which the recognized word itself is known and when it was spoken in the audio stream R_word . It turns out two arrays of words. To synchronize these two lists, I use something like this

from rapidfuzz import process, fuzz
import numpy as np

window = 50
# L_word = ... # words from text book
# R_word = ... # recognize words from audiobook
L = 0
R = 0
L_chunk = L_word[L:L+window]
R_chunk = R_word[R:R+window]
scores = process.cdist(L_chunk, 
                       R_chunk, 
                       scorer=fuzz.ratio, 
                       type=np.uint8, 
                       score_cutoff=100)
p = find_max_path(scores)
# ... path processing ...

... as a result of all the work, we get something like this video book with pagination and subtitles synchronized with audio download 3GB

Python shows how to do debugging and profiling . Go around the algorithm and time functions to see where the bottleneck is

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM