简体   繁体   English

great_expectations 在 ADLS Gen2 上创建 csv 文件的数据源

[英]great_expectations create datasource of csv files on ADLS Gen2

I want to run great_expectation test suites against csv files in my ADLS Gen2.我想在我的 ADLS Gen2 中针对csv文件运行great_expectation测试套件。 On my ADLS, I have a container called "data" in which I have a file at mypath/test/mydata.csv .在我的 ADLS 上,我有一个名为“data”的容器,其中有一个位于mypath/test/mydata.csv的文件。 I use a InferredAssetAzureDataConnector .我使用InferredAssetAzureDataConnector I was able to create and test/validate the data source configuration but I believe there is a "silent" issue which was not caught.我能够创建和测试/验证数据源配置,但我相信有一个未发现的“静默”问题。

The problem is that I cannot create a test suite based on this data source.问题是我无法基于此数据源创建测试套件。 When I run great_expectations suite new ,当我运行great_expectations suite new时,

  • I select (3) to create the suite with the profiler, and then我选择 (3) 以使用探查器创建套件,然后
  • select my newly created datasource, and then选择我新创建的数据源,然后
  • instead of showing me the available files at the data source, it crashes with the following error (see below for full stacktrace):它没有向我显示数据源中的可用文件,而是崩溃并出现以下错误(请参阅下面的完整堆栈跟踪):
TypeError: __init__() missing 1 required positional argument: 'data_asset_name'

When I execute this with a local data source ( InferredAssetFilesystemDataConnector ), it would show me the available files at the data source for selection in the CLI.当我使用本地数据源 ( InferredAssetFilesystemDataConnector ) 执行此操作时,它会向我显示数据源中的可用文件,以便在 CLI 中选择。

Does the error mean it cannot find the csv file on the ADLS and thus has no names to show?该错误是否意味着它无法在 ADLS 上找到 csv 文件,因此没有名称可显示? How do I fix this?我该如何解决?

My Code to create the data source:我的代码来创建数据源:

import great_expectations as ge
from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource, check_if_datasource_name_exists
context = ge.get_context()
datasource_name = "my_datasource_name"


example_yaml = f"""
name: {datasource_name}
class_name: Datasource
execution_engine:
  class_name: SparkDFExecutionEngine
  azure_options:
      account_url: https://<ACCOUNT-NAME>.blob.core.windows.net
      credential: <ACCOUNT-KEY>
data_connectors:
  default_inferred_data_connector_name:
    class_name: InferredAssetAzureDataConnector
    azure_options:
        account_url: https://<ACCOUNT-NAME>.blob.core.windows.net
        credential: <ACCOUNT-KEY>
    container: data
    name_starts_with: mypath/test
    default_regex:
      group_names:
        - data_asset_name
      pattern: (.csv)
  default_runtime_data_connector_name:
    class_name: RuntimeDataConnector
    assets:
      my_runtime_asset_name:
        batch_identifiers:
          - runtime_batch_identifier_name
"""
print(example_yaml)
# Test the yml:
context.test_yaml_config(yaml_config=example_yaml)

The output after creating the data source via the Jupyter Notebook:通过 Jupyter Notebook 创建数据源后的输出:

Attempting to instantiate class from config...
    Instantiating as a Datasource, since class_name is Datasource
    Successfully instantiated Datasource


ExecutionEngine class name: SparkDFExecutionEngine
Data Connectors:
    default_inferred_data_connector_name : InferredAssetAzureDataConnector
    Available data_asset_names (0 of 0):
    Unmatched data_references (0 of 0):[]
    default_runtime_data_connector_name:RuntimeDataConnector
    default_runtime_data_connector_name : RuntimeDataConnector
    Available data_asset_names (1 of 1):
        my_runtime_asset_name (0 of 0): []
    Unmatched data_references (0 of 0):[]
<great_expectations.datasource.new_datasource.Datasource at 0x1cdc9e01f70>

Full error stack:完整的错误堆栈:

Traceback (most recent call last):
  File "c:\coding\python38\lib\runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\coding\python38\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\coding\myrepo\venv\Scripts\great_expectations.exe\__main__.py", line 7, in <module>
  File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\cli.py", line 190, in main
    cli()
  File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "C:\coding\myrepo\venv\lib\site-packages\click\decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\suite.py", line 151, in suite_new
    _suite_new_workflow(
  File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\suite.py", line 335, in _suite_new_workflow
    raise e
  File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\suite.py", line 279, in _suite_new_workflow
    toolkit.add_citation_with_batch_request(
  File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\toolkit.py", line 1020, in add_citation_with_batch_request
    and BatchRequest(**batch_request)
TypeError: __init__() missing 1 required positional argument: 'data_asset_name'

I had a mistake in my regex ... with the following pattern it works flawlessly:我在我的正则表达式中有一个错误......使用以下模式它可以完美地工作:

    default_regex:
      group_names:
        - data_asset_name
      pattern: (.*\.csv)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何通过读取存储在 Databrciks 中的 adls gen2 中的 csv 文件(特定列)来创建 ADLS gen2 中的文件夹 - how to ceate folders in ADLS gen2 by reading a csv file(particular column) stored in adls gen2 in Databrciks 通过数据块从 ADLS gen2 存储中的多个文件夹中读取文件并创建单个目标文件 - Read files from multiple folders from ADLS gen2 storage via databricks and create single target file 将 jars 添加到 great_expectations 的 spark 会话中 - Adding jars to the great_expectations' spark session 如何从 pyspark 数据块在 ADLS gen2 中创建目录 - How to create directory in ADLS gen2 from pyspark databricks 如何将数据帧转换为 great_expectations 数据集? - How do you convert a dataframe to a great_expectations dataset? 如何在 Great_Expectations 中编码不成功、失败的结果 - How to code Unsuccessful, Failed results in Great_Expectations 在 great_expectations 中使用腌制的 Pandas 数据框作为数据资产 - Use a pickled pandas dataframe as a data asset in great_expectations Great_Expectations 条件期望在 Spark 3.2.1 中与 DataBricks 中的 Pandas API - Great_Expectations Conditional Expectation in Spark 3.2.1 with Pandas API in DataBricks 使用 Azure CLI、Rest API 或 Python 在 Azure ADLS gen2 中复制文件 - Copy files within Azure ADLS gen2 using Azure CLI, Rest API or Python 如何使用 Azure Synapse 和 pySpark 笔记本从 ADLS gen2 检索 .dcm 图像文件? - How to retrieve .dcm image files from the ADLS gen2 using Azure Synapse and pySpark notebook?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM