简体   繁体   中英

How to find an item with a specific start string in a set

I have a set of ~10 million items which look something like this:

1234word:something
4321soup:ohnoes
9cake123:itsokay
[...]

Now I'd need to quickly check if an item witha specific start is in the set. For example

x = "4321soup"
is x+* in a_set:
     print ("somthing that looks like " +x +"* is in the set!")

How do I accomplish this? I've considered using a regex, but I have no clue whether it is even possible in this scenario.

^4321soup.*$

Yes it is possible.Try match.If result is positive you have it.If it is None you dont have it.

Do not forget to set m and g flags.

See demo.

http://regex101.com/r/lS5tT3/28

use str.startswith instead of using regex, if you want to match only with the start of the string, also considering the number of lines you are having ~10 million items

#!/usr/bin/python

str = "1234word:something";
print str.startswith( '1234' );

python, considering your contents are inside a file named "mycontentfile"

>>> with open("mycontentfile","r") as  myfile:
...     data=myfile.read()
... 
>>> for item in data.split("\n"):
...     if item.startswith("4321soup"):
...             print item.strip()
... 
4321soup:ohnoes

Hash-set's are very good for checking existance of some element, completely. In your task you need check existence of starting part, not complete element. That's why better use tree or sorted sequence instead of hash mechanism (internal implementation of python set).

However, according to your examples, it looks like you want to check whole part before ':'. For that purpose you can buildup set with these first parts, and then it will be good for checking existence with sets:

items = set(x.split(':')[0] for x in a_set) # a_set can be any iterable

def is_in_the_set(x):
    return x in items

is_in_the_set("4321soup")  # True

In this case, the importance is how to iterate set in the optimistic way.
Since you should check every result until you find the matching result, the best way is create a generator (list expression form) and execute it until you find a result. To accomplish this, I should use next approach.

a_set = set(['1234word:something','4321soup:ohnoes','9cake123:itsokay',]) #a huge set
prefix = '4321soup' #prefix you want to search
next(x for x in a_set if x.startswith(prefix), False) #pass a generator with the desired match condition, and invoke it until it exhaust (will return False) or until it find something

I'm currently thinking that the most reasonable solution would be something like a sorted tree of dicts (key = x and value = y) and the tree is sorted by the dicts keys. - no clue how to do that though – Daedalus Mythos

No need for a tree of dicts ... just a single dictionary would do. If you have the key:value pairs stored in a dictionary, let's say itemdict , you can write

x = "4321soup"
if x in itemdict:
    print ("something that looks like "+x+"* is in the set!")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM