Python逐行读取整个文件 - 内存统计信息

Question

I am trying to read a file with JSON data (3.1M+ records). 我正在尝试使用JSON数据（3.1M +记录）读取文件。 I am trying to test memory and time efficiency between reading whole file as once vs reading file line by line. 我试图在读取整个文件与逐行读取文件之间测试内存和时间效率。

File1 is serialized JSON data that is one list with 3.1M+ dictionaries with size 811M. File1是序列化的JSON数据，是一个包含3.1M +字典，大小为811M的列表。

File2 is serialized JSON data that has each line as a dictionary. File2是序列化的JSON数据，每行都有一个字典。 Totally there are 3.1M+ lines with size 480M. 总共有3.1M +线，大小为480M。

Profile info while reading file1 读取file1时的配置文件信息

(flask)chitturiLaptop:data kiran$ python -m cProfile read_wholefile.json 
3108779
Filename: read_wholefile.json

Line #    Mem usage    Increment   Line Contents
================================================
 5      9.4 MiB      0.0 MiB   @profile
 6                             def read_file():
 7      9.4 MiB      0.0 MiB     f = open("File1.json")
 8   3725.3 MiB   3715.9 MiB     f_json  = json.loads(f.read())
 9   3725.3 MiB      0.0 MiB     print len(f_json)


     23805 function calls (22916 primitive calls) in 30.230 seconds

Profile info while reading file2 读取file2时的配置文件信息

(flask)chitturiLaptop:data kiran$ python -m cProfile read_line_by_line.json 
3108779
Filename: read_line_by_line.json

 Line #    Mem usage    Increment   Line Contents
 ================================================
 4      9.4 MiB      0.0 MiB   @profile
 5                             def read_file():
 6      9.4 MiB      0.0 MiB     data_json = []
 7      9.4 MiB      0.0 MiB     with open("File2.json") as f:
 8   3726.2 MiB   3716.8 MiB       for line in f:
 9   3726.2 MiB      0.0 MiB         data_json.append(json.loads(line))
10   3726.2 MiB      0.0 MiB     print len(data_json)


     28002875 function calls (28001986 primitive calls) in 244.282 seconds

According to this SO post should it not take less memory to iterate through file2? 根据这个SO帖子，它应该不需要更少的内存迭代文件2？ Reading whole file and loading it through JSON took less time too. 读取整个文件并通过JSON加载它也花费的时间更少。

I am running python 2.7.2 on MAC OSX 10.8.5. 我在MAC OSX 10.8.5上运行python 2.7.2。

EDIT 编辑

profile info with json.load json.load的个人资料信息

(flask)chitturiLaptop:data kiran$ python -m cProfile read_wholefile.json 
3108779
Filename: read_wholefile.json

Line #    Mem usage    Increment   Line Contents
================================================
 5      9.4 MiB      0.0 MiB   @profile
 6                             def read_file():
 7      9.4 MiB      0.0 MiB     f = open("File1.json")
 8   3725.3 MiB   3715.9 MiB     f_json  = json.load(f)
 9   3725.3 MiB      0.0 MiB     print len(f_json)
10   3725.3 MiB      0.0 MiB     f.close()


     23820 function calls (22931 primitive calls) in 27.266 seconds

EDIT2 EDIT2

Some statistics to support the answer. 一些统计数据支持答案。

(flask)chitturiLaptop:data kiran$ python -m cProfile read_wholefile.json 
3108779
Filename: read_wholefile.json

Line #    Mem usage    Increment   Line Contents
================================================
 5      9.4 MiB      0.0 MiB   @profile
 6                             def read_file():
 7      9.4 MiB      0.0 MiB     f = open("File1.json")
 8    819.9 MiB    810.6 MiB     serialized = f.read()
 9   4535.8 MiB   3715.9 MiB     deserialized  = json.loads(serialized)
10   4535.8 MiB      0.0 MiB     print len(deserialized)
11   4535.8 MiB      0.0 MiB     f.close()


     23856 function calls (22967 primitive calls) in 26.815 seconds

Answer 1

Your first test doesn't show the memory consumed by reading the whole file into a giant string, since the giant string is discarded before the source line finishes and the profiler isn't showing you memory consumption in the middle of a line. 您的第一个测试没有显示通过将整个文件读取到一个巨大的字符串所消耗的内存，因为在源代码行完成之前丢弃了巨大的字符串，并且分析器没有在行中间显示内存消耗。 If you save the string to a variable: 如果将字符串保存到变量：

serialized = f.read()
deserialized = json.loads(serialized)

you'll see the 811 MB memory consumption for the temporary string. 你会看到临时字符串的811 MB内存消耗。 The ~3725 MB you're seeing in both tests is mostly the deserialized data structure, which is the same in both tests. 您在两个测试中看到的~3725 MB主要是反序列化的数据结构，在两个测试中都是相同的。

Finally, note that json.load(f) is a faster, more concise, and more memory-friendly way to load JSON data from a file than either json.loads(f.read()) or line-by-line iteration. 最后，请注意json.load(f)是一种更快，更简洁，更友好的方式来加载来自文件的JSON数据，而不是json.loads(f.read())或逐行迭代。

Python逐行读取整个文件 - 内存统计信息

问题描述

1 个解决方案

解决方案1
2 已采纳 2013-12-28 12:41:52

Python逐行读取整个文件 - 内存统计信息

问题描述

1 个解决方案

解决方案1 2 已采纳 2013-12-28 12:41:52

解决方案1
2 已采纳 2013-12-28 12:41:52