简体   繁体   中英

Text search elements in a big python list

With a list that looks something like:

cell_lines = ["LN18_CENTRAL_NERVOUS_SYSTEM","769P_KIDNEY","786O_KIDNEY"]

With my dabbling in regular expressions, I can't figure out a compelling way to search individual strings in a list besides looping through each element and performing the search.

How can I retrieve indices containing "KIDNEY" in an efficient way (since I have a list of length thousands)?

Make a list comprehension :

[line for line in cell_lines if "KIDNEY" in line]

This is O(n) since we check every item in a list to contain KIDNEY .

If you would need to make similar queries like this often, you should probably think about reorganizing your data and have a dictionary grouped by categories like KIDNEY :

{
    "KIDNEY": ["769P_KIDNEY","786O_KIDNEY"],
    "NERVOUS_SYSTEM": ["LN18_CENTRAL_NERVOUS_SYSTEM"]
}

In this case, every "by category" lookup would take "constant" time.

You can use a set instead of a list since it performs lookups in constant time.

from bisect import bisect_left
def bi_contains(lst, item):
    """ efficient `item in lst` for sorted lists """
    # if item is larger than the last its not in the list, but the bisect would 
    # find `len(lst)` as the index to insert, so check that first. Else, if the 
    # item is in the list then it has to be at index bisect_left(lst, item)
    return (item <= lst[-1]) and (lst[bisect_left(lst, item)] == item)


Slightly modifying the above code will give you pretty good efficiency.

Here's a list of the data structures available in Python along with the time complexities.
https://wiki.python.org/moin/TimeComplexity

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM