简体   繁体   English

脚本突然使用所有 RAM

[英]Script suddenly using all RAM

I have a python script I am using to convert some very densely formatted csv files into another format that I need.我有一个 python 脚本,用于将一些格式非常密集的 csv 文件转换为我需要的另一种格式。 The csv files are quite large (3GB) so I read them in chunks to avoid using all the RAM (I have 32GB of RAM on the machine I am using). csv 文件非常大(3GB),所以我分块读取它们以避免使用所有 RAM(我正在使用的机器上有 32GB 的 RAM)。

The odd thing that happens in that the script processes one file using a few GB of memory (about ~3GB based on what top says).奇怪的是,脚本使用几 GB 的内存(根据 top 所说的大约 3GB)处理一个文件。

I finish that file and load the next file, again in chunks.我完成该文件并加载下一个文件,再次分块。 Suddenly I am using 25GB, writing to swap, and the process is killed.突然我使用 25GB,写入交换,进程被杀死。 I'm not sure what is changing between the first and second iteration.我不确定第一次和第二次迭代之间发生了什么变化。 I have put in an os.sleep(60) to try to let the garbage collector catch up but it still is going from ~10% memory to ~85% to killed process.我已经放入了一个 os.sleep(60) 试图让垃圾收集器赶上来,但它仍然从 ~10% 的内存到 ~85% 到终止进程。

Here's the main chunk of the script:这是脚本的主要部分:

for file in files:
    sleep(60)
    print(file)
    read_names = True
    count = 0
    for df in pd.read_csv(file, encoding= 'unicode_escape', chunksize=1e4, names=['all']):
        start_index = 0
        count += 1
        if read_names:
            names = df.iloc[0,:].apply(lambda x: x.split(';')).values[0]
            names = names[1:]
            start_index = 2
            read_names = False
        for row in df.iloc[start_index:,:].iterrows():
            data = row[1]
            data_list = data['all'].split(';')
            date_time = data_list[0]
            values = data_list[1:]
            date, time = date_time.split(' ')
            dd, mm, yyyy = date.split('/')
            date = yyyy + '/' + mm + '/' + dd
            for name, value in zip(names, values):
                try:
                    data_dict[name].append([name, date, time, float(value)])
                except:
                    pass
        if count % 5 == 0:
            for name in names:
                start_date = data_dict[name][0][1]
                start_time = data_dict[name][0][2]
                end_date = data_dict[name][-1][1]
                end_time = data_dict[name][-1][2]
                start_dt = start_date + ' ' + start_time
                end_dt = end_date + ' ' + end_time
                dt_index = pd.date_range(start=start_dt, freq='1S', periods=len(data_dict[name]))
                df = pd.DataFrame(data_dict[name], index=dt_index)
                df = df[3].resample('1T').mean().round(10)
                with open(csv_dict[name], 'a') as ff:
                    for index, value in zip(df.index, df.values):
                        date, time = str(index).split(' ')
                        to_write = f"{name}, {date}, {time}, {value}\n"
                        ff.write(to_write)

Is there something I can do to manage this better?我能做些什么来更好地管理这个问题吗? I need to loop over 50 large files for this task.我需要为这个任务循环超过 50 个大文件。

Data format: Input数据格式:输入

time sensor1 sensor2 sensor3 sensor....
2022-07-01 00:00:00; 2.559;.234;0;0;0;.....
2022-07-01 00:00:01; 2.560;.331;0;0;0;.....
2022-07-01 00:00:02; 2.558;.258;0;0;0;.....

Output输出

sensor1, 2019-05-13, 05:58:00, 2.559 
sensor1, 2019-05-13, 05:59:00, 2.560 
sensor1, 2019-05-13, 06:00:00, 2.558 

Edit: interesting finding - the files I am writing to are suddenly not being updated, they are several minutes behind where they should be if writing is occurring as it should.编辑:有趣的发现 - 我正在写入的文件突然没有被更新,如果写入正常发生,它们应该在几分钟后。 The data within the file is not changing either when I check the tail of the file.当我检查文件的尾部时,文件中的数据也没有改变。 Thus I assume the data is building up in the dictionary and swamping RAM, which makes sense.因此,我假设数据正在字典中积累并淹没 RAM,这是有道理的。 Now to understand why the writing isn't happening.现在要了解为什么写作没有发生。

Edit 2: more interesting finds!!编辑 2:更有趣的发现!! The script runs fine of the first csv and a big chunk of the second csv before filling up the RAM and crashing.该脚本在填满 RAM 并崩溃之前运行良好的第一个 csv 和第二个 csv 的一大块。 It seems the ram problem starts with the second file, so I skipped processing that one and magically I am running longer than I have thus far without a memory issue.似乎 ram 问题从第二个文件开始,所以我跳过了处理那个文件,并且神奇地我运行的时间比到目前为止没有内存问题的时间长。 This perhaps is corrupt data that throws something off.这可能是破坏了某些东西的损坏数据。

Given file.csv that looks exactly like:给定file.csv看起来完全像:

time sensor1 sensor2 sensor3 sensor4 sensor5
2022-07-01 00:00:00; 2.559;.234;0;0;0
2022-07-01 00:00:01; 2.560;.331;0;0;0
2022-07-01 00:00:02; 2.558;.258;0;0;0

You're doing a lot more than this, and not using proper pandas methods will kill you on time ( iterrows is basically never the best option).你做的远不止这些,不使用正确的pandas方法会按时杀死你( iterrows基本上从来都不是最好的选择)。 Basically, if you're manually looping over a DataFrame, you're probably doing it wrong.基本上,如果你手动循环一个 DataFrame,你可能做错了。

But, if you follow this pattern of using it as a context manager, instead of trying to treat it as an iterator (which is a deprecated method), you won't have the memory issues.但是,如果您遵循这种将其用作上下文管理器的模式,而不是将其视为迭代器(这是一种已弃用的方法),您将不会遇到内存问题。

files = ['file.csv']
for file in files:
    with open(file) as f:
        # Grab the columns:
        cols = f.readline().split()
        # Initialize the context-manager: 
        # You'll want a larger chunksize, 1e5 should even work.
        with pd.read_csv(f, names=cols, sep=';', chunksize=1) as chunks:
            for df in chunks:
                df[['date', 'time']] = df.time.str.split(expand=True)
                df = df.melt(['date', 'time'], var_name='sensor')
                df = df[['sensor', 'date', 'time', 'value']]
                df.to_csv(f'new_{file}', mode='a', index=False, header=False)

Output of new_file.csv : new_file.csv的输出:

sensor1,2022-07-01,00:00:00,2.559
sensor2,2022-07-01,00:00:00,0.234
sensor3,2022-07-01,00:00:00,0.0
sensor4,2022-07-01,00:00:00,0.0
sensor5,2022-07-01,00:00:00,0.0
sensor1,2022-07-01,00:00:01,2.56
sensor2,2022-07-01,00:00:01,0.331
sensor3,2022-07-01,00:00:01,0.0
sensor4,2022-07-01,00:00:01,0.0
sensor5,2022-07-01,00:00:01,0.0
sensor1,2022-07-01,00:00:02,2.558
sensor2,2022-07-01,00:00:02,0.258
sensor3,2022-07-01,00:00:02,0.0
sensor4,2022-07-01,00:00:02,0.0
sensor5,2022-07-01,00:00:02,0.0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM