简体   繁体   中英

Storing a large dictionary with non-unique key-value pairs

I have a large (over a million characters) text file of this shape:

'abc' 2
'nmb' 3
'sds' 5
'abc' 6

As you see, each line has two elements. The pairs on each line are not unique, meaning 'abc' could map to 2 and 3 and probably many more. I need to store this into a suitable data structure that I could save to a file. Later I would like to see for example how many times a string has showed up and how many times it has been mapped to a certain number. I need to be able to do this relatively quickly, otherwise I could just use the file as-is.

I first tried to create a dictionary and store the data using the json library, which was pretty easy and straightforward to do. But then I realized I can not use that because the key-value pairs are not unique and a key can be mapped to several values and the data structure should preserve that.

So given the largeness of the file and the way I want to use it, what is a good way to do this?

How about a dict of list s?:

{ 
    'abc': [2, 6],
    'nmb': [3],
    'sds': [5]
}

Edit after further understanding the OP's use case: you could also do this:

{
    'abc': {2: 3, 6: 7},
    'nmb': {3: 1},
    'sds': {5: 1},
}

You can also use defaultdict and collections.Counter as mentioned by the other answers to shortcut out of doing some of the work.

how many times a string has showed up and how many times it has been mapped to a certain number.

If that's the specific problem you're trying to solve, I would try a dict mapping the strings to collection.Counter instances. You can then trivially look up by string key and then by numeric key (to get the "count" value) ( data['abc'][2] -> 1 ), or look up by string key and sum the values of the Counter to get a total number of occurrences ( sum(data['abc'].values()) -> 2 ).

You can use a defaultdict here:

from collections import defaultdict

data = defaultdict(list)
with open("input.txt", "rb") as f:
    for line in f:
        key, value = line.split()
        data[key].append(value)

The advantage of the defaultdict is that you don't need to initialize an empty list for every new key that you encounter.

Finding out how many times a key appeared is a simple len(data[key]) statement.

While saving this back, either pickle it, or have a single entry for each key, with comma separated values in the file so that you can reconstruct it quickly later:

with open("output.txt", "wt") as f:
    for key in data:
        f.write("{} {}\n".format(key, ','.join(data[key])))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM