简体   繁体   中英

How to find a specific file in Python

I have a directory with files of the following structure

A2ML1_A8K2U0_MutationOutput.txt
A4GALT_Q9NPC4_MutationOutput.txt
A4GNT_Q9UNA3_MutationOutput.txt
...

The first few letters represent the gene, the next few the Uniprot Number (a unique protein identifier) and MutationOutput is self explanatory.

In Python, I want to execute the following line:

f_outputfile.write(mutation_directory + SOMETHING +line[1+i]+"_MutationOutput.txt\n")

here, line[1+i] correctly identifies the Uniprot ID.

What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.

I know I can list all the files in the directory, then I can do str.split() on each string and find it. But is there a way I can do that smarter? Should I use a dictionary? Can I just do a quick regex search?

The entire directory is about 8,116 files -- so not that many.

Thank you for your help!

What I need to do is correctly identify the gene name. So somehow, I need to quickly search over that directory, find the file that has the line[i+1] value in it's uniprot field and then pull out the gene name.

Think about how you'd do this in the shell:

$ ls mutation_directory/*_A8K2U0_MutationOutput.txt
mutation_directory/A2ML1_A8K2U0_MutationOutput.txt

Or, if you're on Windows:

D:\Somewhere> dir mutation_directory\*_A8K2U0_MutationOutput.txt
A2ML1_A8K2U0_MutationOutput.txt

And you can do the exact same thing in Python, with the glob module:

>>> import glob
>>> glob.glob('mutation_directory/*_A8K2U0_MutationOutput.txt')
['mutation_directory/A2ML1_A8K2U0_MutationOutput.txt']

And of course you can wrap this up in a function:

>>> def find_gene(uniprot):
...     pattern = 'mutation_directory/*_{}_MutationOutput.txt'.format(uniprot)
...     return glob.glob(pattern)[0]

But is there a way I can do that smarter? Should I use a dictionary?

Whether that's "smarter" depends on your use pattern.

If you're looking up thousands of files per run, it would certainly be more efficient to read the directory just once and use a dictionary instead of repeatedly searching. But if you're planning on, eg, reading in an entire file anyway, that's going to take orders of magnitude longer than looking it up, so it probably won't matter. And you know what they say about premature optimization.

But if you want to, you can make a dictionary keyed by the Uniprot number pretty easily:

d = {}
for f in os.listdir('mutation_directory'):
    gene, uniprot, suffix = f.split('_')
    d[uniprot] = f

And then:

>>> d['A8K2U0']
'mutation_directory/A2ML1_A8K2U0_MutationOutput.txt'

Can I just do a quick regex search?

For your simple case, you don't need regular expressions.*

More importantly, what are you going to search? Either you're going to loop—in which case you might as well use glob —or you're going to have to build up an artificial giant string to search—in which case you're better off just building the dictionary.


* In fact, at least on some platforms/implementations, glob is implemented by making a regular expression out of your simple wildcard pattern, but you don't have to worry about that.

You can use glob

In [4]: import glob

In [5]: files = glob.glob('*_Q9UNA3_*')

In [6]: files
Out[6]: ['A4GNT_Q9UNA3_MutationOutput.txt']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM