Airflow Python 脚本退出 Task 退出并返回代码 -9，如何解决？

Question

I don't know what this error means, some people say it's memory error, I'm not sure, because the error is not explicit, but the table I load is large, 1 million of lines.我不知道这个错误是什么意思，有人说是内存错误，我不确定，因为错误不是明确的，但是我加载的表很大，100万行。

Here is the part of my script where the error happens:这是我的脚本中发生错误的部分：

# snapshot_profiles
  df_snapshot_profiles = load_table('snapshot_profiles', conn)

  def return_key(x, key):
    try:
      return (x[key])
    except:
      return (None)

  df_snapshot_profiles['is_manager'] = df_snapshot_profiles["payload"].apply(
      lambda x: return_key(x, 'is_manager'))
  df_snapshot_profiles_actual = df_snapshot_profiles.loc[:,
                                                         ['profile_id', 'date']]
  df_snapshot_profiles_actual.sort_values(['profile_id', 'date'], inplace=True)
  df_snapshot_profiles_actual = df_snapshot_profiles_actual.groupby(
      'profile_id').max().reset_index()
  df_snapshot_profiles.drop(
      ['id', 'payload', 'version', 'company_id', 'inserted_at', 'updated_at'],
      axis=1,
      inplace=True)
  df_snapshot_profiles_actual = df_snapshot_profiles_actual.merge(
      df_snapshot_profiles, on=['date', 'profile_id'], how='left')
  df_snapshot_profiles_actual.drop('date', axis=1, inplace=True)

  df = df.merge(df_snapshot_profiles_actual, on='profile_id', how='left')
  del df_snapshot_profiles

  # Excluir do banco empresas com menos de dois usuários (Empresas de testes)
  df_companies = df.groupby('company_name').count()
  df_companies.reset_index(inplace=True)
  df_companies = df_companies[df_companies['user_id'] > 2]
  df_companies.sort_values('user_id', ascending=False)

  companies = list(df_companies.company_name)

  df['check_company'] = df['company_name'].apply(lambda x: 'T'
                                                 if x in companies else 'F')
  df = df[df['check_company'] == 'T']
  df.drop('check_company', axis=1, inplace=True)

And here is the script to load the tables and print the memory usage:这是加载表并打印内存使用情况的脚本：

def usage():
  process = psutil.Process(os.getpid())
  return process.memory_info()[0] / float(2**20)


def load_table(table, conn):
    print_x(f'{usage()} Mb')
    print_x(f'loading table {table}')
    cursor = conn.cursor()
    cursor.execute(f'''select * from {ORIGIN_SCHEMA}.{table};''')
    df = cursor.fetchall()
    cursor.execute(f'''
        select column_name from information_schema.columns where table_name = '{table}';
    ''')
    labels = cursor.fetchall()
    label_list = []
    for label in labels:
        label_list.append(label[0])
    df = pd.DataFrame.from_records(df, columns=label_list)
    return (df)

Is there a way to avoid the error, by reducing the memory usage or other way??有没有办法通过减少内存使用或其他方式来避免错误？

Answer 1

Well.好。 It should be out of memory issue.应该是内存不足的问题。 You can expand you memory or switch part of work out of core(load work in batch mode)您可以扩展内存或将部分工作切换到核心之外（以批处理模式加载工作）

If you has budget, expand memory.如果您有预算，请扩展内存。 1 million line * terrible string length(1000) per column =1M*1K = 1G memory for data load. 100 万行 * 每列可怕的字符串长度（1000）=1M*1K = 1G 内存用于数据加载。 when merge dataframe, or transform dataframe, you need extra memory, so 16G should be OK.合并数据帧或转换数据帧时，您需要额外的内存，因此16G应该可以。
if you are expert, try out of core mode, that mean work on the hard disk.如果您是专家，请尝试退出核心模式，这意味着在硬盘上工作。
- dask is one of pandas out of core module. dask 是核心模块之外的熊猫之一。 computer on batch mode.批处理模式下的计算机。 slow but still work.缓慢但仍然有效。
- use database for some feature work.使用数据库进行一些功能工作。 I found most of database can do similar work like pandas though complicated SQL code needed.我发现尽管需要复杂的 SQL 代码，但大多数数据库都可以像 Pandas 一样完成类似的工作。

Good luck.祝你好运。 If you like my answer, plse vote it up.如果您喜欢我的回答，请投票。

Answer 2

I solved this issue by implementing a server side cursor, and getting info in chunks, like so:我通过实现服务器端游标并分块获取信息解决了这个问题，如下所示：

  serverCursor = conn.cursor("serverCursor")
  serverCursor.execute(f'''select * from {ORIGIN_SCHEMA}.{table};''')

  df = []
  while True:
    records = serverCursor.fetchmany(size=50000)
    df = df + records
    if not records:
      break
  serverCursor.close()

Airflow Python 脚本退出 Task 退出并返回代码 -9，如何解决？

问题描述

2 个解决方案

解决方案1
5 2019-08-13 16:38:23

解决方案2
2 已采纳 2019-08-21 14:06:42

Airflow Python 脚本退出 Task 退出并返回代码 -9，如何解决？

问题描述

2 个解决方案

解决方案1 5 2019-08-13 16:38:23

解决方案2 2 已采纳 2019-08-21 14:06:42

解决方案1
5 2019-08-13 16:38:23

解决方案2
2 已采纳 2019-08-21 14:06:42