简体   繁体   中英

Python - find index of unique substring contained in list of strings without going through all the items

I have a question that might sound like something already asked but in reality I can't find a real good answer for this. Every day I have a list with a few thousand strings in it. I also know that this string will always contain literally one item containing the word "other". For example, one day I may have:

a = ['mark','george', .... , " ...other ...", "matt','lisa', ... ]

another day I may get:

a = ['karen','chris','lucas', ............................., '...other']

As you can see the position of the item containing the substring "other" is random. My goal is to get as fast as possible the index of the item containing the substring 'other'. I found other answers here where most of the people suggest list comprehensions of look for. for example: Finding a substring within a list in Python and Check if a Python list item contains a string inside another string They don't work for me because they are too slow. Also, other solutions suggest to use "any" to simply check if "other" is contained in the list, but I need the index not a boolean value. I believe regex might be a good potential solution even though I'm having a hard time to figure out how. So far I simply managed to do the following:

# any_other_value_available  will tell me extremely quickly if 'other' is contained in list.
any_other_value_available = 'other' in str(list_unique_keys_in_dict).lower()

from here, I don't quite know what to do. Any suggestions? Thank you

Methods Explored

1. Generator Method

next(i for i,v in enumerate(test_strings) if 'other' in v)

2. List Comprehension Method

[i for i,v in enumerate(test_strings) if 'other' in v]

3. Using Index with Generator (suggested by @HeapOverflow)

test_strings.index(next(v for v in test_strings if 'other' in v))

4. Regular Expression with Generator

re_pattern = re.compile('.*other.*')
next(test_strings.index(x) for x in test_strings if re_pattern.search(x))

Conclusion

Index Method had the fastest time (method suggested by @HeapOverflow in comments).

Test Code

Using Perfplot which uses timeit

import random 
import string
import re
import perfplot

def random_string(N):
    return ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(N))

def create_strings(length):
    M = length // 2
    random_strings = [random_string(5) for _ in range(length)]

    front = ['...other...'] + random_strings
    middle = random_strings[:M] + ['...other...'] + random_strings[M:]
    end_ = random_strings + ['...other...']

    return front, middle, end_

def search_list_comprehension(test_strings):
    return [i for i,v in enumerate(test_strings) if 'other' in v][0]

def search_genearator(test_strings):
    return next(i for i,v in enumerate(test_strings) if 'other' in v)

def search_index(test_strings):
    return test_strings.index(next(v for v in test_strings if 'other' in v))

def search_regex(test_strings):
    re_pattern = re.compile('.*other.*')
    return next(test_strings.index(x) for x in test_strings if re_pattern.search(x))

# Each benchmark is run with the '..other...' placed in the front, middle and end of a random list of strings.

out = perfplot.bench(
    setup=lambda n: create_strings(n),  # create front, middle, end strings of length n
    kernels=[
        lambda a: [search_list_comprehension(x) for x in a],
        lambda a: [search_genearator(x) for x in a],
        lambda a: [search_index(x) for x in a],
        lambda a: [search_regex(x) for x in a],
    ],
    labels=["list_comp", "generator", "index", "regex"],
    n_range=[2 ** k for k in range(15)],
    xlabel="lenght list",
    # More optional arguments with their default values:
    # title=None,
    # logx="auto",  # set to True or False to force scaling
    # logy="auto",
    # equality_check=numpy.allclose,  # set to None to disable "correctness" assertion
    # automatic_order=True,
    # colors=None,
    # target_time_per_measurement=1.0,
    # time_unit="s",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
    # relative_to=1,  # plot the timings relative to one of the measurements
    # flops=lambda n: 3*n,  # FLOPS plots
)

out.show()
print(out)

Results

性能图

length list   regex    list_comp  generator    index
     1.0     10199.0     3699.0     4199.0     3899.0
     2.0     11399.0     3899.0     4300.0     4199.0
     4.0     13099.0     4300.0     4599.0     4300.0
     8.0     16300.0     5299.0     5099.0     4800.0
    16.0     22399.0     7199.0     5999.0     5699.0
    32.0     34900.0    10799.0     7799.0     7499.0
    64.0     59300.0    18599.0    11799.0    11200.0
   128.0    108599.0    33899.0    19299.0    18500.0
   256.0    205899.0    64699.0    34699.0    33099.0
   512.0    403000.0   138199.0    69099.0    62499.0
  1024.0    798900.0   285600.0   142599.0   120900.0
  2048.0   1599999.0   582999.0   288699.0   239299.0
  4096.0   3191899.0  1179200.0   583599.0   478899.0
  8192.0   6332699.0  2356400.0  1176399.0   953500.0
 16384.0  12779600.0  4731100.0  2339099.0  1897100.0

If you are looking for a substring, regular expressions are a good way to find it.

In your case you are looking for all substrings that contain 'other'. As you have already mentioned, there is no special order of the elements in the list. Therefore the search for your desired element is linear, even if it is ordered.

A regular expression that might describe your search is query='.*other.*' . Regarding the documentation

. (Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

* Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match 'a', 'ab', or 'a' followed by any number of 'b's.

With .* before and after other there can be 0 or more repetitions of any character.

For example

import re

list_of_variables = ['rossum', 'python', '..other..', 'random']
query = '.*other.*'
indices = [list_of_variables.index(x) for x in list_of_variables if re.search(query, x)]

Which will return a list of indices containing your query . In this example indices will be [2] , since '...other...' is the third element in the list.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM