I have a large (over a million characters) text file of this shape:
'abc' 2
'nmb' 3
'sds' 5
'abc' 6
As you see, each line has two elements. The pairs on each line are not unique, meaning 'abc' could map to 2 and 3 and probably many more. I need to store this into a suitable data structure that I could save to a file. Later I would like to see for example how many times a string has showed up and how many times it has been mapped to a certain number. I need to be able to do this relatively quickly, otherwise I could just use the file as-is.
I first tried to create a dictionary and store the data using the json
library, which was pretty easy and straightforward to do. But then I realized I can not use that because the key-value pairs are not unique and a key can be mapped to several values and the data structure should preserve that.
So given the largeness of the file and the way I want to use it, what is a good way to do this?
How about a dict
of list
s?:
{
'abc': [2, 6],
'nmb': [3],
'sds': [5]
}
Edit after further understanding the OP's use case: you could also do this:
{
'abc': {2: 3, 6: 7},
'nmb': {3: 1},
'sds': {5: 1},
}
You can also use defaultdict
and collections.Counter
as mentioned by the other answers to shortcut out of doing some of the work.
how many times a string has showed up and how many times it has been mapped to a certain number.
If that's the specific problem you're trying to solve, I would try a dict mapping the strings to collection.Counter
instances. You can then trivially look up by string key and then by numeric key (to get the "count" value) ( data['abc'][2]
-> 1
), or look up by string key and sum the values of the Counter to get a total number of occurrences ( sum(data['abc'].values())
-> 2
).
You can use a defaultdict
here:
from collections import defaultdict
data = defaultdict(list)
with open("input.txt", "rb") as f:
for line in f:
key, value = line.split()
data[key].append(value)
The advantage of the defaultdict is that you don't need to initialize an empty list for every new key that you encounter.
Finding out how many times a key appeared is a simple len(data[key])
statement.
While saving this back, either pickle it, or have a single entry for each key, with comma separated values in the file so that you can reconstruct it quickly later:
with open("output.txt", "wt") as f:
for key in data:
f.write("{} {}\n".format(key, ','.join(data[key])))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.