简体   繁体   中英

python search for a set of words

In simple terms I'm looking for the quickest way to search for a set of words in a string using regular expressions without using a for loop. ie is there a way to do this:

text = 'asdfadfgargqerno_TP53_dfgnafoqwefe_ATM_cvafukyhfjakhdfialb'
genes = set(['TP53','ATM','BRCA2'])
mutations = 0
if re.search( genes, text):
    mutations += 1
print mutations 
>>>1

The reason for this is because I'm searching a complicated data structure and don't want to nest too many loops. Here is problem code in more detail:

genes = set(['TP53','ATM','BRCA2'])
single_gene = 'ATM'
mutations = 0
data_dict = {
             sample1=set(['AAA','BBB','TP53'])
             sample2=set(['AAA','ATM','TP53'])
             sample3=set(['AAA','CCC','XXX'])
             sample4=set(['AAA','ZZZ','BRCA2'])
            }

for sample in data_dict:
    for gene in data_dict[sample] 
        if re.search( single_gene, gene):
            mutations += 1
            break

I can easily search for 'single_gene', but I want to search for 'genes'. If I add another for loop to iterate through 'genes' then the code will become more complicated because I will have to add another 'break' and a boolean to control when the break occurs? Functionally it works but is very clunky and there must be a more elegant way to do it? See my clunky extra loop for the set below (currently my only solution):

for sample in data_dict:
    for gene in data_dict[sample] 
        MUT = False
        for mut in genes:
            if re.search( mut, gene):
                mutations += 1
                MUT = True
                break
        if MUT == True:
            break

IMPORTANTLY: I am only looking to add 0 or 1 to 'mutations' if ANY gene from 'genes' occurs in the set for each sample. ie 'sample2' will add 1 to mutations and sample 3 will add 0. Let me know if anything needs further clarifying. Thanks in advance!

If your target strings are fixed text (that is, not regular expressions) don't use re . It is far more efficient to:

for gene in genes:
    if gene in text:
        print('True')

there are variations on that theme such as:

if [gene for gene in genes if gene in text]:
    ...

which is pretty confusing to read, contains a list comprehension, and counts on the fact that an empty list [] is considered false in Python.

Updated because the question changed:

You are still doing it the hard way. Consider instead:

def find_any_gene(genes, text):
    """Returns True if any of the subsequences in genes
       is found within text.
    """
    for gene in genes:
        if gene in text:
           return True
    return False

mutations = 0
text = '...'

for sample in data_dict:
    for genes in data_dict[sample]
         if find_any_gene(genes, text):
             mutations += 1

This has the advantages of less code needed to short-circuit the search, greater readability, and the function find_any_gene() can be called by other code.

Does this work? I used some examples from the comments.

Let me know if I am close?!

genes = set(['TP53','ATM','BRCA2', 'aaC', 'CDH'])
mutations = 0
data_dict = {
             "sample1":set(['AAA','BBB','TP53']),
             "sample2":set(['AAA','ATM','TP53']),
             "sample3":set(['AAA','CCC','XXX']),
             "sample4":set(['123CDH47aaCDHzz','ZZZ','BRCA2'])
            }

for sample in data_dict:
    for gene in data_dict[sample]:
        if [ mut for mut in genes if mut in gene ]:
            print "Found mutation: "+str(gene),
            print "in sample: "+str(data_dict[sample])
            mutations += 1

print mutations

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM