繁体   English   中英

在谷歌云上运行 python 脚本时出现内存错误

[英]MemoryError when running python script on google cloud

我正在尝试使用 Google 云运行一个脚本,该脚本对test.csv文件的每一行进行预测。 我使用云是因为看起来 Google Colab 需要一些时间。 但是,当我运行它时,会出现 memory 错误:

(pre_env) mikempc3@instance-1:~$ python predictSales.py 
Traceback (most recent call last):
  File "predictSales.py", line 7, in <module>
    sales = pd.read_csv("sales_train.csv")
  File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/io/parsers.py", line 685, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/io/parsers.py", line 463, in _read
    data = parser.read(nrows)
  File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/io/parsers.py", line 1169, in read
    df = DataFrame(col_dict, columns=columns, index=index)
  File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/frame.py", line 411, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/construction.py", line 257, in init_dict
    return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/construction.py", line 87, in arrays_to_mgr
    return create_block_manager_from_arrays(arrays, arr_names, axes)
  File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/managers.py", line 1694, in create_block_manager_from_arrays
    blocks = form_blocks(arrays, names, axes)
  File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/managers.py", line 1764, in form_blocks
    int_blocks = _multi_blockify(items_dict["IntBlock"])
  File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/managers.py", line 1846, in _multi_blockify
    values, placement = _stack_arrays(list(tup_block), dtype)
  File "/home/mikempc3/pre_env/lib/python3.5/site-packages/pandas/core/internals/managers.py", line 1874, in _stack_arrays
    stacked = np.empty(shape, dtype=dtype)
MemoryError: Unable to allocate 67.2 MiB for an array with shape (3, 2935849) and data type int64

这是我的脚本:

import statsmodels.tsa.arima.model as smt
import pandas as pd
import datetime
import numpy as np


sales = pd.read_csv("sales_train.csv")
test = pd.read_csv("test.csv")

sales.date = sales.date.apply(lambda x: datetime.datetime.strptime(x, "%d.%m.%Y"))

sales_monthly = sales.groupby(
    ["date_block_num", "shop_id", "item_id"])["date", "item_price",
                                              "item_cnt_day"].agg({
    "date": ["min", "max"],
    "item_price": "mean",
    "item_cnt_day": "sum"})

array = []

for i, row in test.iterrows():
    print("row['shop_id']: ", row['shop_id'], " row['item_id']: ", row['item_id'])
    print(statsmodels.__version__)
    ts = pd.DataFrame(sales_monthly.loc[pd.IndexSlice[:, [row['shop_id']], [row['item_id']]], :]['item_price'].values *
                      sales_monthly.loc[pd.IndexSlice[:, [row['shop_id']], [row['item_id']]], :][
                          'item_cnt_day'].values).T.iloc[0]
    print(ts.values)
    if ts.values != [] and len(ts.values) > 2:
        best_aic = np.inf
        best_order = None
        best_model = None

        ranges = range(1, 5)
        for difference in ranges:
            # try:
            tmp_model = smt.ARIMA(ts.values, order=(0, 1, 0), trend='t').fit()
            tmp_aic = tmp_model.aic
            if tmp_aic < best_aic:
                best_aic = tmp_aic
                best_difference = difference
                best_model = tmp_model
                # except Exception as e:
                #     print(e)
                #     continue
        if best_model is not None:
            y_hat = best_model.forecast()[0]
            if y_hat < 0:
                y_hat = 0
        else:
            y_hat = 0
    else:
        y_hat = 0
    print("predicted:", y_hat)
    d = {'id': row['ID'], 'item_cnt_month': y_hat}
    array.append(d)
    print("-------------------")

df = pd.DataFrame(array)
df.to_csv("submission.csv")

您可以使用 Fil memory 分析器 ( https://pythonspeed.com/fil ) 来确定哪些代码行负责峰值 memory 使用。 它还将处理内存不足的情况并在您用完时转储报告。

唯一需要注意的是 (1) 它需要 Python 3.6 或更高版本,并且 (2) 只能在 Linux 或 macOS 上运行。 我们已经到了 3.9,所以无论如何可能是时候升级了。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM