简体   繁体   English

在谷歌云中运行 Apache Beam 作业时找不到错误模块

[英]Error module not found while running Apache beam job in google cloud

I am trying to run an Apache beam job in google cloud but is unsuccessful in completing it.我正在尝试在谷歌云中运行 Apache Beam 作业,但未能成功完成。 I have tried debugging and other troubleshooting steps but it's still getting stuck every time, Here's the error:我尝试了调试和其他故障排除步骤,但每次仍然卡住,这是错误:

  File "/home/avien/.pyenv/versions/dataflow/lib/python3.8/site-packages/apache_beam/transforms/core.py", line 1730, in <lambda>
    wrapper = lambda x: [fn(x)]
  File "xmlload.py", line 59, in <lambda>
NameError: name 'parse_into_dict' is not defined [while running 'parse-ptransform-73']

while running without the lamda function and directly passing it in beam.Map() it changed to:在没有 lamda 函数的情况下运行并直接将其传递给 beam.Map() 时,它更改为:

File "/home/avien/.pyenv/versions/dataflow/lib/python3.8/site-packages/apache_beam/transforms/core.py", line 1730, in <lambda>
    wrapper = lambda x: [fn(x)]
  File "xmlload.py", line 36, in parse_into_dict
ModuleNotFoundError: No module named 'xmltodict' [while running 'parse-ptransform-73']

I have already setup pyenv and install xmltodict:我已经设置了 pyenv 并安装了 xmltodict:

Requirement already satisfied: xmltodict in ./.pyenv/versions/3.8.13/envs/dataflow/lib/python3.8/site-packages (0.13.0)

Here the pipeline am trying to run:这里的管道正在尝试运行:

import argparse
import logging
import apache_beam as beam
import xmltodict

def parse_into_dict(xmlfile):
    import xmltodict
    import apache_beam as beam
    with open(xmlfile) as ifp:
        doc = xmltodict.parse(ifp.read())
        return doc

table_schema = {
    'fields': [
        {'name' : 'CustomerID', 'type': 'STRING', 'mode': 'NULLABLE'},
        {'name' : 'EmployeeID', 'type': 'STRING', 'mode': 'NULLABLE'},
        {'name' : 'OrderDate', 'type': 'STRING', 'mode': 'NULLABLE'},
        {'name' : 'RequiredDate', 'type': 'STRING', 'mode': 'NULLABLE'},
        {'name' : 'ShipInfo', 'type': 'RECORD', 'mode': 'NULLABLE', 'fields': [
            {'name' : 'ShipVia', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'Freight', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShipName', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShipAddress', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShipCity', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShipRegion', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShipPostalCode', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShipCountry', 'type': 'STRING', 'mode': 'NULLABLE'},
            {'name' : 'ShippedDate', 'type': 'STRING', 'mode': 'NULLABLE'},
        ]},
    ]
}

def cleanup(x):
    import copy
    y = copy.deepcopy(x)
    if '@ShippedDate' in x['ShipInfo']: # optional attribute
        y['ShipInfo']['ShippedDate'] = x['ShipInfo']['@ShippedDate']
        del y['ShipInfo']['@ShippedDate']
    print(y)
    return y

def get_orders(doc):
    for order in doc['Root']['Orders']['Order']:
        yield cleanup(order)

def run(argv=None):
    parser = argparse.ArgumentParser()
    parser.add_argument(
      '--output',
      required=True,
      help=(
          'Specify text file orders.txt or BigQuery table project:dataset.table '))

    known_args, pipeline_args = parser.parse_known_args(argv)
    with beam.Pipeline(argv=pipeline_args) as p:
        orders = (p
             | 'files' >> beam.Create(['orders.xml'])
             | 'parse' >> beam.Map(parse_into_dict)
             | 'orders' >> beam.FlatMap(get_orders))

        if '.txt' in known_args.output:
             orders | 'totxt' >> beam.io.WriteToText(known_args.output)
        else:
             orders | 'tobq' >> beam.io.WriteToBigQuery(known_args.output,
                                       schema=table_schema,
                                       write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND, #WRITE_TRUNCATE
                                       create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)


if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()

I have tried the following steps till now:到目前为止,我已经尝试了以下步骤:

  1. tried to include all the functions inside pipeline itself and but the result is same.试图将所有功能包含在管道本身中,但结果是一样的。
  2. Include all the imports in every function在每个函数中包含所有导入

Also, when running the parse_into_dict in a standalone python file it's not throwing any error at all am successfully able to convert xml to dict.此外,在独立的 python 文件中运行 parse_into_dict 时,它不会抛出任何错误,我可以成功地将 xml 转换为 dict。

Any help is highly appreciated, Thanks in advance!非常感谢任何帮助,在此先感谢!

Try importing modules inside your function and pipeline definitions;尝试在你的函数和管道定义中导入模块; or use --save_main_session .或使用--save_main_session NameError s are common because the worker doesn't know objects defined in the global namespace. NameError很常见,因为工作人员不知道全局命名空间中定义的对象。

In addition to @ningk answer, you have to give dataflow your orders.xml file.除了@ningk 答案之外,您还必须为数据流提供您的orders.xml文件。 You are trying to load this file in the first step of your pipeline ( beam.Create['orders.xml'] ), however, dataflow does not know/have this file when it executes your pipeline.您正在尝试在管道的第一步 ( beam.Create['orders.xml'] ) 中加载此文件,但是,数据流在执行您的管道时不知道/拥有此文件。

Try adding a MANIFEST.in file (mind the caps) with the following content尝试添加具有以下内容的MANIFEST.in文件(注意大写)

include path/to/xml/orders.xml

in the source folder of your pipeline code.在管道代码的源文件夹中。 See here for an example file.有关示例文件,请参见此处

I had a similar problem with dependencies only with dataflow runner and it helped to include --requirements_file requirements.txt while running your script, so you end up with something like:我在仅与数据流运行器的依赖项方面遇到了类似的问题,它有助于在运行脚本时包含--requirements_file requirements.txt ,所以你最终会得到类似的东西:

python pyscript.py --requirements_file requirements.txt

Check beam documentation https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/检查光束文档https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在现有的谷歌云 VM 上运行 Apache-beam 管道作业 - Run Apache-beam pipeline job on existing google cloud VM 一旦使用 apache 光束 sdk 在 Google Cloud 中创建数据流作业,我们可以从云存储桶中删除 tmp 文件吗? - Once dataflow job is created in Google Cloud using apache beam sdk, can we delete the tmp files from cloud storage bucket? 在 Google Cloud Dataflow 上运行的 Apache Beam 中禁用特定 class 的日志记录 - Disable logging from a specific class in Apache Beam running on Google Cloud Dataflow 访问 PCollectionView 的元素<list<foo> &gt;: 谷歌云数据流/Apache Beam </list<foo> - Access elements of PCollectionView<List<Foo>> : Google Cloud Dataflow/Apache Beam 谷歌云中的日志记录信息/调试消息 apache 光束 python sdk - logging info/debug messages in google cloud apache beam python sdk Google Dataflow 上的 Apache Beam 示例的权限错误 - Permissions error with Apache Beam example on Google Dataflow 在 Apache Beam/Google Cloud Dataflow 上创建文件和数据流 - Creating a file and streaming in data on Apache Beam/Google Cloud Dataflow 错误:运行时出现“找不到模块” firebase 云 function - Error: "Cannot find module" while running firebase cloud function apache 光束与 gcp 云 function - apache beam with gcp cloud function Apache Beam:同步 pod 时出错 - 在 $PATH 中找不到可执行文件 - Apache Beam: Error syncing pod - executable file not found in $PATH
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM