[英]Script suddenly using all RAM
I have a python script I am using to convert some very densely formatted csv files into another format that I need.我有一个 python 脚本,用于将一些格式非常密集的 csv 文件转换为我需要的另一种格式。 The csv files are quite large (3GB) so I read them in chunks to avoid using all the RAM (I have 32GB of RAM on the machine I am using).
csv 文件非常大(3GB),所以我分块读取它们以避免使用所有 RAM(我正在使用的机器上有 32GB 的 RAM)。
The odd thing that happens in that the script processes one file using a few GB of memory (about ~3GB based on what top says).奇怪的是,脚本使用几 GB 的内存(根据 top 所说的大约 3GB)处理一个文件。
I finish that file and load the next file, again in chunks.我完成该文件并加载下一个文件,再次分块。 Suddenly I am using 25GB, writing to swap, and the process is killed.
突然我使用 25GB,写入交换,进程被杀死。 I'm not sure what is changing between the first and second iteration.
我不确定第一次和第二次迭代之间发生了什么变化。 I have put in an os.sleep(60) to try to let the garbage collector catch up but it still is going from ~10% memory to ~85% to killed process.
我已经放入了一个 os.sleep(60) 试图让垃圾收集器赶上来,但它仍然从 ~10% 的内存到 ~85% 到终止进程。
Here's the main chunk of the script:这是脚本的主要部分:
for file in files:
sleep(60)
print(file)
read_names = True
count = 0
for df in pd.read_csv(file, encoding= 'unicode_escape', chunksize=1e4, names=['all']):
start_index = 0
count += 1
if read_names:
names = df.iloc[0,:].apply(lambda x: x.split(';')).values[0]
names = names[1:]
start_index = 2
read_names = False
for row in df.iloc[start_index:,:].iterrows():
data = row[1]
data_list = data['all'].split(';')
date_time = data_list[0]
values = data_list[1:]
date, time = date_time.split(' ')
dd, mm, yyyy = date.split('/')
date = yyyy + '/' + mm + '/' + dd
for name, value in zip(names, values):
try:
data_dict[name].append([name, date, time, float(value)])
except:
pass
if count % 5 == 0:
for name in names:
start_date = data_dict[name][0][1]
start_time = data_dict[name][0][2]
end_date = data_dict[name][-1][1]
end_time = data_dict[name][-1][2]
start_dt = start_date + ' ' + start_time
end_dt = end_date + ' ' + end_time
dt_index = pd.date_range(start=start_dt, freq='1S', periods=len(data_dict[name]))
df = pd.DataFrame(data_dict[name], index=dt_index)
df = df[3].resample('1T').mean().round(10)
with open(csv_dict[name], 'a') as ff:
for index, value in zip(df.index, df.values):
date, time = str(index).split(' ')
to_write = f"{name}, {date}, {time}, {value}\n"
ff.write(to_write)
Is there something I can do to manage this better?我能做些什么来更好地管理这个问题吗? I need to loop over 50 large files for this task.
我需要为这个任务循环超过 50 个大文件。
Data format: Input数据格式:输入
time sensor1 sensor2 sensor3 sensor....
2022-07-01 00:00:00; 2.559;.234;0;0;0;.....
2022-07-01 00:00:01; 2.560;.331;0;0;0;.....
2022-07-01 00:00:02; 2.558;.258;0;0;0;.....
Output输出
sensor1, 2019-05-13, 05:58:00, 2.559
sensor1, 2019-05-13, 05:59:00, 2.560
sensor1, 2019-05-13, 06:00:00, 2.558
Edit: interesting finding - the files I am writing to are suddenly not being updated, they are several minutes behind where they should be if writing is occurring as it should.编辑:有趣的发现 - 我正在写入的文件突然没有被更新,如果写入正常发生,它们应该在几分钟后。 The data within the file is not changing either when I check the tail of the file.
当我检查文件的尾部时,文件中的数据也没有改变。 Thus I assume the data is building up in the dictionary and swamping RAM, which makes sense.
因此,我假设数据正在字典中积累并淹没 RAM,这是有道理的。 Now to understand why the writing isn't happening.
现在要了解为什么写作没有发生。
Edit 2: more interesting finds!!编辑 2:更有趣的发现!! The script runs fine of the first csv and a big chunk of the second csv before filling up the RAM and crashing.
该脚本在填满 RAM 并崩溃之前运行良好的第一个 csv 和第二个 csv 的一大块。 It seems the ram problem starts with the second file, so I skipped processing that one and magically I am running longer than I have thus far without a memory issue.
似乎 ram 问题从第二个文件开始,所以我跳过了处理那个文件,并且神奇地我运行的时间比到目前为止没有内存问题的时间长。 This perhaps is corrupt data that throws something off.
这可能是破坏了某些东西的损坏数据。
Given file.csv
that looks exactly like:给定
file.csv
看起来完全像:
time sensor1 sensor2 sensor3 sensor4 sensor5
2022-07-01 00:00:00; 2.559;.234;0;0;0
2022-07-01 00:00:01; 2.560;.331;0;0;0
2022-07-01 00:00:02; 2.558;.258;0;0;0
You're doing a lot more than this, and not using proper pandas
methods will kill you on time ( iterrows
is basically never the best option).你做的远不止这些,不使用正确的
pandas
方法会按时杀死你( iterrows
基本上从来都不是最好的选择)。 Basically, if you're manually looping over a DataFrame, you're probably doing it wrong.基本上,如果你手动循环一个 DataFrame,你可能做错了。
But, if you follow this pattern of using it as a context manager, instead of trying to treat it as an iterator (which is a deprecated method), you won't have the memory issues.但是,如果您遵循这种将其用作上下文管理器的模式,而不是将其视为迭代器(这是一种已弃用的方法),您将不会遇到内存问题。
files = ['file.csv']
for file in files:
with open(file) as f:
# Grab the columns:
cols = f.readline().split()
# Initialize the context-manager:
# You'll want a larger chunksize, 1e5 should even work.
with pd.read_csv(f, names=cols, sep=';', chunksize=1) as chunks:
for df in chunks:
df[['date', 'time']] = df.time.str.split(expand=True)
df = df.melt(['date', 'time'], var_name='sensor')
df = df[['sensor', 'date', 'time', 'value']]
df.to_csv(f'new_{file}', mode='a', index=False, header=False)
Output of new_file.csv
: new_file.csv
的输出:
sensor1,2022-07-01,00:00:00,2.559
sensor2,2022-07-01,00:00:00,0.234
sensor3,2022-07-01,00:00:00,0.0
sensor4,2022-07-01,00:00:00,0.0
sensor5,2022-07-01,00:00:00,0.0
sensor1,2022-07-01,00:00:01,2.56
sensor2,2022-07-01,00:00:01,0.331
sensor3,2022-07-01,00:00:01,0.0
sensor4,2022-07-01,00:00:01,0.0
sensor5,2022-07-01,00:00:01,0.0
sensor1,2022-07-01,00:00:02,2.558
sensor2,2022-07-01,00:00:02,0.258
sensor3,2022-07-01,00:00:02,0.0
sensor4,2022-07-01,00:00:02,0.0
sensor5,2022-07-01,00:00:02,0.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.