在谷歌云中运行 Apache Beam 作业时找不到错误模块

Question

I am trying to run an Apache beam job in google cloud but is unsuccessful in completing it.我正在尝试在谷歌云中运行 Apache Beam 作业，但未能成功完成。 I have tried debugging and other troubleshooting steps but it's still getting stuck every time, Here's the error:我尝试了调试和其他故障排除步骤，但每次仍然卡住，这是错误：

  File "/home/avien/.pyenv/versions/dataflow/lib/python3.8/site-packages/apache_beam/transforms/core.py", line 1730, in <lambda>
    wrapper = lambda x: [fn(x)]
  File "xmlload.py", line 59, in <lambda>
NameError: name 'parse_into_dict' is not defined [while running 'parse-ptransform-73']

while running without the lamda function and directly passing it in beam.Map() it changed to:在没有 lamda 函数的情况下运行并直接将其传递给 beam.Map() 时，它更改为：

File "/home/avien/.pyenv/versions/dataflow/lib/python3.8/site-packages/apache_beam/transforms/core.py", line 1730, in <lambda>
    wrapper = lambda x: [fn(x)]
  File "xmlload.py", line 36, in parse_into_dict
ModuleNotFoundError: No module named 'xmltodict' [while running 'parse-ptransform-73']

I have already setup pyenv and install xmltodict:我已经设置了 pyenv 并安装了 xmltodict：

Requirement already satisfied: xmltodict in ./.pyenv/versions/3.8.13/envs/dataflow/lib/python3.8/site-packages (0.13.0)

Here the pipeline am trying to run:这里的管道正在尝试运行：

import argparse
import logging
import apache_beam as beam
import xmltodict

def parse_into_dict(xmlfile):
    import xmltodict
    import apache_beam as beam
    with open(xmlfile) as ifp:
        doc = xmltodict.parse(ifp.read())
        return doc

table_schema = {
    'fields': [
        {'name' : 'CustomerID', 'type': 'STRING', 'mode': 'NULLABLE'},
        {'name' : 'EmployeeID', 'type': 'STRING', 'mode': 'NULLABLE'},
        {'name' : 'OrderDate', 'type': 'STRING', 'mode': 'NULLABLE'},
        {'name' : 'RequiredDate', 'type': 'STRING', 'mode': 'NULLABLE'},
        {'name' : 'ShipInfo', 'type': 'RECORD', 'mode': 'NULLABLE', 'fields': [
            {'name' : 'ShipVia', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'Freight', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShipName', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShipAddress', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShipCity', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShipRegion', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShipPostalCode', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShipCountry', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShippedDate', 'type': 'STRING', 'mode': 'NULLABLE'},
        ]},
    ]
}

def cleanup(x):
    import copy
    y = copy.deepcopy(x)
    if '@ShippedDate' in x['ShipInfo']: # optional attribute
        y['ShipInfo']['ShippedDate'] = x['ShipInfo']['@ShippedDate']
        del y['ShipInfo']['@ShippedDate']
    print(y)
    return y

def get_orders(doc):
    for order in doc['Root']['Orders']['Order']:
        yield cleanup(order)

def run(argv=None):
    parser = argparse.ArgumentParser()
    parser.add_argument(
      '--output',
      required=True,
      help=(
          'Specify text file orders.txt or BigQuery table project:dataset.table '))

    known_args, pipeline_args = parser.parse_known_args(argv)
    with beam.Pipeline(argv=pipeline_args) as p:
        orders = (p
             | 'files' >> beam.Create(['orders.xml'])
             | 'parse' >> beam.Map(parse_into_dict)
             | 'orders' >> beam.FlatMap(get_orders))

        if '.txt' in known_args.output:
             orders | 'totxt' >> beam.io.WriteToText(known_args.output)
        else:
             orders | 'tobq' >> beam.io.WriteToBigQuery(known_args.output,
                                       schema=table_schema,
                                       write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND, #WRITE_TRUNCATE
                                       create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)


if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

I have tried the following steps till now:到目前为止，我已经尝试了以下步骤：

tried to include all the functions inside pipeline itself and but the result is same.试图将所有功能包含在管道本身中，但结果是一样的。
Include all the imports in every function在每个函数中包含所有导入

Also, when running the parse_into_dict in a standalone python file it's not throwing any error at all am successfully able to convert xml to dict.此外，在独立的 python 文件中运行 parse_into_dict 时，它不会抛出任何错误，我可以成功地将 xml 转换为 dict。

Any help is highly appreciated, Thanks in advance!非常感谢任何帮助，在此先感谢！

Answer 1

Try importing modules inside your function and pipeline definitions;尝试在你的函数和管道定义中导入模块； or use --save_main_session .或使用--save_main_session 。 NameError s are common because the worker doesn't know objects defined in the global namespace. NameError很常见，因为工作人员不知道全局命名空间中定义的对象。

Answer 2

In addition to @ningk answer, you have to give dataflow your orders.xml file.除了@ningk 答案之外，您还必须为数据流提供您的orders.xml文件。 You are trying to load this file in the first step of your pipeline ( beam.Create['orders.xml'] ), however, dataflow does not know/have this file when it executes your pipeline.您正在尝试在管道的第一步 ( beam.Create['orders.xml'] ) 中加载此文件，但是，数据流在执行您的管道时不知道/拥有此文件。

Try adding a MANIFEST.in file (mind the caps) with the following content尝试添加具有以下内容的MANIFEST.in文件（注意大写）

include path/to/xml/orders.xml

in the source folder of your pipeline code.在管道代码的源文件夹中。 See here for an example file.有关示例文件，请参见此处。

Answer 3

I had a similar problem with dependencies only with dataflow runner and it helped to include --requirements_file requirements.txt while running your script, so you end up with something like:我在仅与数据流运行器的依赖项方面遇到了类似的问题，它有助于在运行脚本时包含--requirements_file requirements.txt ，所以你最终会得到类似的东西：

python pyscript.py --requirements_file requirements.txt

Check beam documentation https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/检查光束文档https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

在谷歌云中运行 Apache Beam 作业时找不到错误模块

问题描述

3 个解决方案

解决方案1
0 2022-06-07 17:56:56

解决方案2
0 2022-06-18 13:34:31

解决方案3
0 2022-12-20 12:34:19

在谷歌云中运行 Apache Beam 作业时找不到错误模块

问题描述

3 个解决方案

解决方案1 0 2022-06-07 17:56:56

解决方案2 0 2022-06-18 13:34:31

解决方案3 0 2022-12-20 12:34:19

解决方案1
0 2022-06-07 17:56:56

解决方案2
0 2022-06-18 13:34:31

解决方案3
0 2022-12-20 12:34:19