great_expectations 在 ADLS Gen2 上创建 csv 文件的数据源

Question

I want to run great_expectation test suites against csv files in my ADLS Gen2.我想在我的 ADLS Gen2 中针对csv文件运行great_expectation测试套件。 On my ADLS, I have a container called "data" in which I have a file at mypath/test/mydata.csv .在我的 ADLS 上，我有一个名为“data”的容器，其中有一个位于mypath/test/mydata.csv的文件。 I use a InferredAssetAzureDataConnector .我使用InferredAssetAzureDataConnector 。 I was able to create and test/validate the data source configuration but I believe there is a "silent" issue which was not caught.我能够创建和测试/验证数据源配置，但我相信有一个未发现的“静默”问题。

The problem is that I cannot create a test suite based on this data source.问题是我无法基于此数据源创建测试套件。 When I run great_expectations suite new ,当我运行great_expectations suite new时，

I select (3) to create the suite with the profiler, and then我选择 (3) 以使用探查器创建套件，然后
select my newly created datasource, and then选择我新创建的数据源，然后
instead of showing me the available files at the data source, it crashes with the following error (see below for full stacktrace):它没有向我显示数据源中的可用文件，而是崩溃并出现以下错误（请参阅下面的完整堆栈跟踪）：

TypeError: __init__() missing 1 required positional argument: 'data_asset_name'

When I execute this with a local data source ( InferredAssetFilesystemDataConnector ), it would show me the available files at the data source for selection in the CLI.当我使用本地数据源 ( InferredAssetFilesystemDataConnector ) 执行此操作时，它会向我显示数据源中的可用文件，以便在 CLI 中选择。

Does the error mean it cannot find the csv file on the ADLS and thus has no names to show?该错误是否意味着它无法在 ADLS 上找到 csv 文件，因此没有名称可显示？ How do I fix this?我该如何解决？

My Code to create the data source:我的代码来创建数据源：

import great_expectations as ge
from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource, check_if_datasource_name_exists
context = ge.get_context()
datasource_name = "my_datasource_name"


example_yaml = f"""
name: {datasource_name}
class_name: Datasource
execution_engine:
  class_name: SparkDFExecutionEngine
  azure_options:
      account_url: https://<ACCOUNT-NAME>.blob.core.windows.net
      credential: <ACCOUNT-KEY>
data_connectors:
  default_inferred_data_connector_name:
    class_name: InferredAssetAzureDataConnector
    azure_options:
        account_url: https://<ACCOUNT-NAME>.blob.core.windows.net
        credential: <ACCOUNT-KEY>
    container: data
    name_starts_with: mypath/test
    default_regex:
      group_names:
        - data_asset_name
      pattern: (.csv)
  default_runtime_data_connector_name:
    class_name: RuntimeDataConnector
    assets:
      my_runtime_asset_name:
        batch_identifiers:
          - runtime_batch_identifier_name
"""
print(example_yaml)
# Test the yml:
context.test_yaml_config(yaml_config=example_yaml)

The output after creating the data source via the Jupyter Notebook:通过 Jupyter Notebook 创建数据源后的输出：

Attempting to instantiate class from config...
    Instantiating as a Datasource, since class_name is Datasource
    Successfully instantiated Datasource


ExecutionEngine class name: SparkDFExecutionEngine
Data Connectors:
    default_inferred_data_connector_name : InferredAssetAzureDataConnector
    Available data_asset_names (0 of 0):
    Unmatched data_references (0 of 0):[]
    default_runtime_data_connector_name:RuntimeDataConnector
    default_runtime_data_connector_name : RuntimeDataConnector
    Available data_asset_names (1 of 1):
        my_runtime_asset_name (0 of 0): []
    Unmatched data_references (0 of 0):[]
<great_expectations.datasource.new_datasource.Datasource at 0x1cdc9e01f70>

Full error stack:完整的错误堆栈：

Traceback (most recent call last):
  File "c:\coding\python38\lib\runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "c:\coding\python38\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\coding\myrepo\venv\Scripts\great_expectations.exe\__main__.py", line 7, in <module>
  File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\cli.py", line 190, in main
    cli()
  File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\coding\myrepo\venv\lib\site-packages\click\core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "C:\coding\myrepo\venv\lib\site-packages\click\decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\suite.py", line 151, in suite_new
    _suite_new_workflow(
  File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\suite.py", line 335, in _suite_new_workflow
    raise e
  File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\suite.py", line 279, in _suite_new_workflow
    toolkit.add_citation_with_batch_request(
  File "C:\coding\myrepo\venv\lib\site-packages\great_expectations\cli\toolkit.py", line 1020, in add_citation_with_batch_request
    and BatchRequest(**batch_request)
TypeError: __init__() missing 1 required positional argument: 'data_asset_name'

Answer 1

I had a mistake in my regex ... with the following pattern it works flawlessly:我在我的正则表达式中有一个错误......使用以下模式它可以完美地工作：

    default_regex:
      group_names:
        - data_asset_name
      pattern: (.*\.csv)

great_expectations 在 ADLS Gen2 上创建 csv 文件的数据源

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-07-05 11:58:34

great_expectations 在 ADLS Gen2 上创建 csv 文件的数据源

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-07-05 11:58:34

解决方案1
1 已采纳 2022-07-05 11:58:34