如何根据可用的 memory 确定批次的大小？

Question

I need to read an arbitrarily big file, parse it (which means keep some data in memory while doing it), then write a new version of the file to the file system.我需要读取一个任意大的文件，解析它（这意味着在执行此操作时将一些数据保留在 memory 中），然后将文件的新版本写入文件系统。 Given the memory constraints, I need to read the file incrementally or in batches.鉴于 memory 约束，我需要增量或分批读取文件。 However, the bigger the batches, the better (because the information used to parse each line of the file is contained in the other lines of the file).但是，批次越大越好（因为用于解析文件每一行的信息包含在文件的其他行中）。

Apparently, I can get information about memory usage with something like显然，我可以通过类似的方式获取有关 memory 使用情况的信息

import psutil
psutil.virtual_memory()

which also returns the memory available in percentage.它还返回 memory 可用百分比。 See this answer https://stackoverflow.com/a/11615673/3924118 for more info.有关更多信息，请参阅此答案https://stackoverflow.com/a/11615673/3924118 。

I would like to determine the size of the batches based on available memory and based on the memory used by and reserved for the current Python process.我想根据可用的 memory 和当前 Python 进程使用和保留的 memory 来确定批次的大小。

Apparently, I can get the memory used by the current Python process with显然，我可以得到当前 Python 进程使用的 memory

import os
import psutil
process = psutil.Process(os.getpid())
print(process.memory_info().rss)  # in bytes

See https://stackoverflow.com/a/21632554/3924118 for more info.有关详细信息，请参阅https://stackoverflow.com/a/21632554/3924118 。

So, is there a way of having an adaptive batch size (or generator), based on the available memory dedicated to the current Python process and the total system available memory, so that I can read as many lines as the available memory allows at a time, then read the next batch of lines, etc.? So, is there a way of having an adaptive batch size (or generator), based on the available memory dedicated to the current Python process and the total system available memory, so that I can read as many lines as the available memory allows at a时间，然后读取下一批线等？ In other words, I need to incrementally read the file, such that the number of lines read at once is maximized, while satisfying memory constraints (within a certain threshold, for example, until 90% of the memory is used).换句话说，我需要增量读取文件，使得一次读取的行数最大化，同时满足 memory 约束（在某个阈值内，例如，直到 90% 的 memory 被使用）。

Answer 1

I would fix the size of the data you were reading at a time rather than attempt to randomly fill you memory.我会一次固定您正在读取的数据的大小，而不是尝试随机填充您的 memory。 Read your data in fixed blocks.以固定块读取数据。 Much easier to deal with.处理起来容易得多。

如何根据可用的 memory 确定批次的大小？

问题描述

1 个解决方案

解决方案1
-1 2019-09-26 15:39:49

如何根据可用的 memory 确定批次的大小？

问题描述

1 个解决方案

解决方案1 -1 2019-09-26 15:39:49

解决方案1
-1 2019-09-26 15:39:49