I intend to read a file, which is about 500MB in total, into a dict according to the key in each line. The code snippet is as follows:
f2 = open("ENST-NM-chr-name.txt", "r") # small amount
lines = [l.strip() for l in f2.readlines() if l.strip()]
sample = dict([(l.split("\t")[2].strip("\""), l) for l in lines]) ## convert [(1,2), (3,4)] to {1:2, 3:4}
When running on a machine with memory of 4GB, the python complains Memory Error. If I change the evaluation expression of sample
variable to [l for l in lines]
, it works fine.
At first, I thought it was due to the split
method that was consuming lots of memory, so I adjust my code to this:
def find_nth(haystack, needle, n):
start = haystack.find(needle)
while start >= 0 and n > 1:
start = haystack.find(needle, start+len(needle))
n -= 1
return start
...
sample = dict([(l[find_nth(l, "\t", 4):].strip(), l) for l in lines])
But it turns out the same.
A new discovery is that it will run normally without OOM provided I remove the dict()
conversion regardless of the code logic.
Could anyone give me some idea on this problem?
You're creates a list containing every line, which will continue to exist until lines
goes out of scope, then creating another big list of entirely different strings based off of it, then a dict
off of that before it can go out of memory. Just build the dict
in one step.
with open("ENST-NM-chr-name.txt") as f:
sample = {}
for l in f:
l = l.strip()
if l:
sample[l.split("\t")[2].strip('"')] = l
You can achieve about the same effect by using a generator expression instead of a list comprehension, but it feels nicer (to me) not to strip
twice.
What if you turn your list into a generator, and your dict into a lovely dictionary comprehension :
f2 = open("ENST-NM-chr-name.txt", "r") # small amount
lines = (l.strip() for l in f2 if l.strip())
sample = {line.split('\t')[2].strip('\"'): line for line in lines}
Line 2 above was mistakenly lines = (l.strip() for l in f2.readlines() if l.strip())
Do a generator and a dict comprehension perhaps (somehow) alleviate the memory requirements?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.