简体   繁体   English

如何根据可用的 memory 确定批次的大小?

[英]How can I determine the size of a batch based on the available memory?

I need to read an arbitrarily big file, parse it (which means keep some data in memory while doing it), then write a new version of the file to the file system.我需要读取一个任意大的文件,解析它(这意味着在执行此操作时将一些数据保留在 memory 中),然后将文件的新版本写入文件系统。 Given the memory constraints, I need to read the file incrementally or in batches.鉴于 memory 约束,我需要增量或分批读取文件。 However, the bigger the batches, the better (because the information used to parse each line of the file is contained in the other lines of the file).但是,批次越大越好(因为用于解析文件每一行的信息包含在文件的其他行中)。

Apparently, I can get information about memory usage with something like显然,我可以通过类似的方式获取有关 memory 使用情况的信息

import psutil
psutil.virtual_memory()

which also returns the memory available in percentage.它还返回 memory 可用百分比。 See this answer https://stackoverflow.com/a/11615673/3924118 for more info.有关更多信息,请参阅此答案https://stackoverflow.com/a/11615673/3924118

I would like to determine the size of the batches based on available memory and based on the memory used by and reserved for the current Python process.我想根据可用的 memory 和当前 Python 进程使用和保留的 memory 来确定批次的大小。

Apparently, I can get the memory used by the current Python process with显然,我可以得到当前 Python 进程使用的 memory

import os
import psutil
process = psutil.Process(os.getpid())
print(process.memory_info().rss)  # in bytes 

See https://stackoverflow.com/a/21632554/3924118 for more info.有关详细信息,请参阅https://stackoverflow.com/a/21632554/3924118

So, is there a way of having an adaptive batch size (or generator), based on the available memory dedicated to the current Python process and the total system available memory, so that I can read as many lines as the available memory allows at a time, then read the next batch of lines, etc.? So, is there a way of having an adaptive batch size (or generator), based on the available memory dedicated to the current Python process and the total system available memory, so that I can read as many lines as the available memory allows at a时间,然后读取下一批线等? In other words, I need to incrementally read the file, such that the number of lines read at once is maximized, while satisfying memory constraints (within a certain threshold, for example, until 90% of the memory is used).换句话说,我需要增量读取文件,使得一次读取的行数最大化,同时满足 memory 约束(在某个阈值内,例如,直到 90% 的 memory 被使用)。

I would fix the size of the data you were reading at a time rather than attempt to randomly fill you memory.我会一次固定您正在读取的数据的大小,而不是尝试随机填充您的 memory。 Read your data in fixed blocks.以固定块读取数据。 Much easier to deal with.处理起来容易得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM