简体   繁体   中英

Searching a string in a list in python

Hello I have a string which contains a mail address. For example ( user@foo.bar.com ) And I have a list which contains only domains ('bar.com','stackoverflow.com') etc.

I want to search the list if it contains my string's domain. Right now I am using a code like this

if tokens[1].partition("@")[2] in domainlist:

tokens[1] contains the mail address and domainlist contains the domains. But as you can see the result of tokens[1].partition("@")[2] will return foo.bar.com but my list has the domain bar.com . How can I make this if statement return true? And it should be very fast because hundreds of mail addresses will come in every second

它应该像这样工作:

if any(tokens[1].endswith(domain) for domain in domainlist): 

If speed is really an issue for you, you can look into methods like Aho-Corasick. There are plenty of implementations available, like esmre / esm http://code.google.com/p/esmre/

As pointed out by @Riccardo Galli, simple string matching will produce some false positives, so you can try with esmre first, adding according regexes into index, something like index.enter("(^|\\.){0}$".format(domain))

First, make domainlist a set. It will be faster to check whether there is something contained in it.

Second, add all 'superdomains' into this set, such as 'bar.com' for 'foo.bar.com'.

domainlist = ['foo.bar.com', 'bar2.com', 'foo3.bar3.foobar.com']
domainset = set()
for domain in domainlist:
    parts = domain.split('.')
    domainset.update('.'.join(parts[i:]) for i in xrange(len(parts)-1))

#domainset is now:
set(['bar.com',
     'bar2.com',
     'bar3.foobar.com',
     'foo.bar.com',
     'foo3.bar3.foobar.com',
     'foobar.com'])

And now you can test

if tokens[1].partition("@")[2] in domainset:

Hundreds of mail addresses should not be an issue. The following is a one-liner:

any(domain.endswith(d) for d in MY_DOMAINS)

Here, you can do user,sep,domain = address.rpartition('@') . Otherwise, your current method will fail for email addresses such as "B@tm4n"@something.com , which are valid according to http://tools.ietf.org/html/rfc5322

If performance becomes a factor, you can use a Trie (a kind of data structure). If performance is still a factor, you can use other tricks.

The above goes through each element in the domains you're checking, so if you have 1000 domains in your list, you need to do 1000 lookups for each email address. If this is an issue, you can do this to achieve O(1) per lookup (you also probably want to make sure you're not checking more than 5 suffixes, to protect yourself from maliciously crafted email addresses).

MY_DOMAINS = set(MY_DOMAINS)

def suffixes(domain):
    """
        suffixes('foo.bar.com') -yields-> ['foo.bar.com', 'bar.com', 'com']
    """
    while True:
        yield domain
        parts = domain.split('.',1)
        if len(parts>1)
            domain = parts[1]
        else:
            break
def isInList(address):
    user,sep,domain = address.rpartition('@')
    return any(suffix in MY_DOMAINS for suffix in suffixes(domain))

Opposite to other answers, here 'foo.com' would not match also '@y.afoo.com'

def mailInDomains(mail,domains):

    for domain in domainList:
        dLen = len(domain)
        if mail[-dLen:]==domain and mail[-dLen-1] in ('.','@'):
            return True

    return False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM