简体   繁体   中英

Create a python dictionary from a tab delimited file that is not 1:1

I want to create two python3 dictionaries from a tab delimited file (no header). The file has 2 columns that I want to name group_id and gene_id. A group may have multiple genes and a gene can belong to multiple groups. I will demonstrate what I want with a simple example.

group_id gene_id

A        a
A        b
A        c
A        d
B        a
B        c
B        e

I would like to have 2 dictionaries:

dict1 = {'A':(a,b,c,d),'B':(a,c,e)}

and

dict2 = {'a':(A,B), 'b':(A), 'c':(A,B), 'd':(A), 'e':(B)}

I would like to store the values in tuples for speed because my file is 2.5 GBs and I will end up with big dictionaries that I have to work with later.

I know there are a lot of questions like this but I can't find an answer from those as they deal with files that have key:value pairs.

Thanks!

I think the code kind of speaks for itself here, but basically since you are working with strings, you can just have the two separate dicts and then parse through each line. If you have a new value you have to create a new entry, which you do with an if statement. One point, you should uses lists because tuples are immutable and can't be changed after you create them:

data = """group_id gene_id
    A        a
    A        b
    A        c
    A        d
    B        a
    B        c
    B        e"""

lines = data.splitlines()
group_dict = {}
gene_dict = {}

for line in lines[1:]:
    group, gene = line.split()
    if group not in group_dict.keys():
        group_dict[group] = list()
    group_dict[group].append(gene)

    if gene not in gene_dict.keys():
        gene_dict[gene] = list()
    gene_dict[gene].append(group)

from pprint import pprint
pprint(group_dict)
pprint(gene_dict)

prints:

{'A': ['a', 'b', 'c', 'd'], 'B': ['a', 'c', 'e']}
{'a': ['A', 'B'], 'b': ['A'], 'c': ['A', 'B'], 'd': ['A'], 'e': ['B']}

the collections module has a defaultdict method which returns a new dictionary-like object. Just append the values to each key and you are pretty much done.

from collections import defaultdict
dict1 = defaultdict(list)
dict2 = defaultdict(list)

with open("C:/path/example.txt") as f:
    header = f.next()
    for line in f:
        if line.strip():
            a,b =  line.strip().split()
            dict1[a].append(b)
            dict2[b].append(a)

print dict1

returns

defaultdict(<type 'list'>, {'A': ['a', 'b', 'c', 'd'], 'B': ['a', 'c', 'e']})

and dict2

defaultdict(<type 'list'>, {'a': ['A', 'B'], 'c': ['A', 'B'], 'b': ['A'], 'e': ['B'], 'd': ['A']})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM