简体   繁体   中英

How to append characters to a string being used as a python dictionary key (when there are multiple entries related to that string)?

I am pulling out sequence coordinates from the output file produced by HMMER (finds DNA sequences, matching a query, in a genome assembly file).

I create a python dictionary where the key is the source sequence name (a string), and the value is a list comprising the start and end coordinates of the target sequence. However, HMMER often finds multiple matches on a single source sequence (contig/chromosome).

This means that as I add to the dictionary, if I come across multiple matches on a contig, each is overwritten by the following match.

Eg HMMER finds the following matches:

Name Start End

4415 16723 17556

127 1290 1145

1263 34900 37834

4415 2073 3899

4415 4580 6004

But this results in the following dictionary (I want separate entries for each match):

{'127': ['1290', '1145'], '1263': ['34900', '37834'], '4415': ['4580', '6004']}

How can I append a letter to the key so that subsequent matches are unique and do not overwrite the previous ones, ie 4415, 4415a, 4415b, and so on?

matches = {}

for each line of HMMER file:
    split the line
    make a list of fields 4 & 5 (the coordinates)
    # at this stage I need a way of checking whether the key (sequenceName)
    # is already in the dictionary (easy), and if it is, appending a letter
    # to sequenceName to make it unique
    matches[sequenceName] = list

It's not a proper way to go to create different keys while the are equal, instead you can use a list for your values and preserve the coordinates in it, for duplicate keys. You can use collections.defaultdict() for this aim:

>>> coords = [['4415', '16723', '17556'], ['127', '1290', '1145'], ['1263', '34900', '37834'], ['4415', '2073', '3899'], ['4415', '4580', '6004']]
>>> from collections import defaultdict
>>> 
>>> d = defaultdict(list)
>>> 
>>> for i, j, k in coords:
...     d[i].append((j, k))
... 
>>> d
defaultdict(<type 'list'>, {'1263': [('34900', '37834')], '4415': [('16723', '17556'), ('2073', '3899'), ('4580', '6004')], '127': [('1290', '1145')]})

Besides, the idea of adding a character at the end of the keys in not optimum, because you need to have the count of keys always and you are not aware of this number so you have to generate new suffix.

But as an alternative if you only use the count of the keys you can create different ones by preserving the keys in a Counter() object and adding the count at the trailing of the key:

>>> from collections import Counter
>>> d = {}
>>> c = Counter()
>>> for i, j, k in coords:
...     c.update((i,))
...     d["{}_{}".format(i, c[i])] = (j, k)
... 
>>> d
{'4415_1': ('16723', '17556'), '4415_3': ('4580', '6004'), '4415_2': ('2073', '3899'), '127_1': ('1290', '1145'), '1263_1': ('34900', '37834')}

You can do something like this:

matches = {'127': ['1290', '1145'], '1263': ['34900', '37834'], '4415': ['4580', '6004']}

# sample key_name
key_name = '4415'
if key_name in matches.keys():
    for i in xrange(1,26):
        if key_name+chr(ord('a') + i) not in matches.keys():
                matches[key_name+chr(ord('a') + i)] = #your value

This will increment your key_names as 4415a, 4415b...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM