Airflow Python 腳本退出 Task 退出並返回代碼 -9，如何解決？

Question

我不知道這個錯誤是什么意思，有人說是內存錯誤，我不確定，因為錯誤不是明確的，但是我加載的表很大，100萬行。

這是我的腳本中發生錯誤的部分：

# snapshot_profiles
  df_snapshot_profiles = load_table('snapshot_profiles', conn)

  def return_key(x, key):
    try:
      return (x[key])
    except:
      return (None)

  df_snapshot_profiles['is_manager'] = df_snapshot_profiles["payload"].apply(
      lambda x: return_key(x, 'is_manager'))
  df_snapshot_profiles_actual = df_snapshot_profiles.loc[:,
                                                         ['profile_id', 'date']]
  df_snapshot_profiles_actual.sort_values(['profile_id', 'date'], inplace=True)
  df_snapshot_profiles_actual = df_snapshot_profiles_actual.groupby(
      'profile_id').max().reset_index()
  df_snapshot_profiles.drop(
      ['id', 'payload', 'version', 'company_id', 'inserted_at', 'updated_at'],
      axis=1,
      inplace=True)
  df_snapshot_profiles_actual = df_snapshot_profiles_actual.merge(
      df_snapshot_profiles, on=['date', 'profile_id'], how='left')
  df_snapshot_profiles_actual.drop('date', axis=1, inplace=True)

  df = df.merge(df_snapshot_profiles_actual, on='profile_id', how='left')
  del df_snapshot_profiles

  # Excluir do banco empresas com menos de dois usuários (Empresas de testes)
  df_companies = df.groupby('company_name').count()
  df_companies.reset_index(inplace=True)
  df_companies = df_companies[df_companies['user_id'] > 2]
  df_companies.sort_values('user_id', ascending=False)

  companies = list(df_companies.company_name)

  df['check_company'] = df['company_name'].apply(lambda x: 'T'
                                                 if x in companies else 'F')
  df = df[df['check_company'] == 'T']
  df.drop('check_company', axis=1, inplace=True)

這是加載表並打印內存使用情況的腳本：

def usage():
  process = psutil.Process(os.getpid())
  return process.memory_info()[0] / float(2**20)


def load_table(table, conn):
    print_x(f'{usage()} Mb')
    print_x(f'loading table {table}')
    cursor = conn.cursor()
    cursor.execute(f'''select * from {ORIGIN_SCHEMA}.{table};''')
    df = cursor.fetchall()
    cursor.execute(f'''
        select column_name from information_schema.columns where table_name = '{table}';
    ''')
    labels = cursor.fetchall()
    label_list = []
    for label in labels:
        label_list.append(label[0])
    df = pd.DataFrame.from_records(df, columns=label_list)
    return (df)

有沒有辦法通過減少內存使用或其他方式來避免錯誤？

Answer 1

好。 應該是內存不足的問題。 您可以擴展內存或將部分工作切換到核心之外（以批處理模式加載工作）

如果您有預算，請擴展內存。 100 萬行 * 每列可怕的字符串長度（1000）=1M*1K = 1G 內存用於數據加載。 合並數據幀或轉換數據幀時，您需要額外的內存，因此16G應該可以。
如果您是專家，請嘗試退出核心模式，這意味着在硬盤上工作。
- dask 是核心模塊之外的熊貓之一。 批處理模式下的計算機。 緩慢但仍然有效。
- 使用數據庫進行一些功能工作。 我發現盡管需要復雜的 SQL 代碼，但大多數數據庫都可以像 Pandas 一樣完成類似的工作。

祝你好運。 如果您喜歡我的回答，請投票。

Answer 2

我通過實現服務器端游標並分塊獲取信息解決了這個問題，如下所示：

  serverCursor = conn.cursor("serverCursor")
  serverCursor.execute(f'''select * from {ORIGIN_SCHEMA}.{table};''')

  df = []
  while True:
    records = serverCursor.fetchmany(size=50000)
    df = df + records
    if not records:
      break
  serverCursor.close()

Airflow Python 腳本退出 Task 退出並返回代碼 -9，如何解決？

問題描述

2 個解決方案

解決方案1
5 2019-08-13 16:38:23

解決方案2
2 已采納 2019-08-21 14:06:42

Airflow Python 腳本退出 Task 退出並返回代碼 -9，如何解決？

問題描述

2 個解決方案

解決方案1 5 2019-08-13 16:38:23

解決方案2 2 已采納 2019-08-21 14:06:42

解決方案1
5 2019-08-13 16:38:23

解決方案2
2 已采納 2019-08-21 14:06:42