TFX 组件 CsvExampleGen 总是产生带有空输出（和输入）的示例

Question

I can run CsvExampleGen without an error message, but the outputs (and inputs) of the resulting Examples are always empty.我可以在没有错误消息的情况下运行 CsvExampleGen，但结果示例的输出（和输入）始终为空。

I am using tfx==0.24.0.我正在使用 tfx==0.24.0。

To use CsvExampleGen for reading CSV files, according to the docu & tutorials (incl. https://www.tensorflow.org/tfx/guide/examplegen ) + the release notes for tfx 0.23.0/0.24.0 ( https://github.com/tensorflow/tfx/releases ), the following lines of code should suffice to read a CVS file:要使用CsvExampleGen读取的CSV文件，根据实况和教程（包括https://www.tensorflow.org/tfx/guide/examplegen +对TFX 0.23.0 / 0.24.0（发行说明）的https：/ /github.com/tensorflow/tfx/releases ），以下代码行应该足以读取 CVS 文件：

from tfx.components import CsvExampleGen
example_gen = CsvExampleGen(input_base=data_path)

where "data_path" identifies a directory with CVS files.其中“data_path”标识包含 CVS 文件的目录。 (Note that the code differs from the official docu in that is does not use "external_input"; instead it follows the new interface documented in the release notes for 0.23.0.) （请注意，代码与官方文档的不同之处在于不使用“external_input”；而是遵循 0.23.0 发行说明中记录的新界面。）

From tutorials I gather that a single, simple CVS file should suffice for testing (though I tried with up to 7 files).从教程中，我认为一个简单的 CVS 文件应该足以进行测试（尽管我尝试了多达 7 个文件）。

I do not get any error message (except for one which I am told to ignore if I don't have a GPU available);我没有收到任何错误消息（除了一个我被告知如果我没有可用的 GPU 时忽略的消息）； however, the outputs (and inputs) of the resulting structure are empty (empty list and empty set / dict, respectively).然而，结果结构的输出（和输入）是空的（分别是空列表和空集/字典）。 I think they should not be empty, however.但是，我认为它们不应该是空的。

The CSV files in question ARE found and touched, because if I introduce an error there (like an additional column in one row), I do get an error message.有问题的 CSV 文件被找到并被触及，因为如果我在那里引入一个错误（比如一行中的附加列），我会收到一条错误消息。

I tried this with a stand-alone function as well as inside a pipeline (run with BeamDagRunner, for simplicity).我在一个独立的函数和管道中尝试了这个（为了简单起见，用 BeamDagRunner 运行）。 The pipeline does generate a metadata.db, but I cannot find any trace of the CSV data there (like column names).管道确实生成了一个 metadata.db，但我在那里找不到任何 CSV 数据的痕迹（如列名）。 Adding a StatisticsGen to the pipeline didn't help any further.向管道添加 StatisticsGen 没有任何帮助。

I tried this with the iris dataset, with and without column headers.我用 iris 数据集尝试了这个，有和没有列标题。 I also tried with up to 7 small, artificial CVS files within data_path, alternatively with purely numerical and mixed numerical/categorial data and alternatively with commas and semicolons as separators.我还尝试在 data_path 中使用最多 7 个小型人工 CVS 文件，或者使用纯数字和混合数字/分类数据，或者使用逗号和分号作为分隔符。 The result is always the same.结果总是一样的。

Do I have a problem with the code, or maybe with some configuration or libraries?我的代码有问题，还是某些配置或库有问题？

Here is the full code (as far as possibly relevant):这是完整的代码（尽可能相关）：

PIPELINE_NAME = "X-pipeline-iris2"
BASE_PATH = r"C:\***\FX_Experiments"
BASE_PATH_PIPELINE = os.path.join(BASE_PATH, "pipeline")
BASE_PATH_TESTS = os.path.join(BASE_PATH, "tests")
PIPELINE_ROOT = os.path.join(BASE_PATH_PIPELINE, "output")
METADATA_PATH = os.path.join(BASE_PATH_PIPELINE, "tfx_metadata", PIPELINE_NAME, "metadata.db")
DATA_PATH = os.path.join(BASE_PATH_TESTS, "iris2")
ENABLE_CACHE = True


def create_pipeline(
        pipeline_name: Text, pipeline_root: Text, data_path: Text,
        enable_cache: bool,
        metadata_connection_config: Optional[metadata_store_pb2.ConnectionConfig] = None,
        beam_pipeline_args: Optional[List[Text]] = None
):
    components = []

    example_gen = CsvExampleGen(input_base=data_path)
    components.append(example_gen)

    stat_gen = StatisticsGen(examples=example_gen.outputs['examples'])
    components.append(stat_gen)

    return pipeline.Pipeline(
        pipeline_name = pipeline_name,
        pipeline_root = pipeline_root,
        components = components,
        enable_cache = enable_cache,
        metadata_connection_config = metadata_connection_config,
        beam_pipeline_args = beam_pipeline_args
    )

def run_pipeline():
    this_pipeline = create_pipeline(
        pipeline_name=PIPELINE_NAME,
        pipeline_root=PIPELINE_ROOT,
        data_path=DATA_PATH,
        enable_cache=ENABLE_CACHE,
        metadata_connection_config=metadata.sqlite_metadata_connection_config(METADATA_PATH)
    )
    BeamDagRunner().run(this_pipeline)

Also potentially useful: logger info:也可能有用：记录器信息：

INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Component CsvExampleGen depends on [].
INFO:absl:Component CsvExampleGen is scheduled.
INFO:absl:Component StatisticsGen depends on ['Run[CsvExampleGen]'].
INFO:absl:Component StatisticsGen is scheduled.
INFO:absl:Component CsvExampleGen is running.
INFO:absl:Running driver for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:Running publisher for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component CsvExampleGen is finished.
INFO:absl:Component StatisticsGen is running.
...

Answer 1

Felix, if you follow the guides you probably running your code in a notebook. Felix，如果您遵循指南，您可能会在笔记本中运行您的代码。 If you want to see the results directly you have to enable TFX interactive using InteractiveContext.如果您想直接查看结果，您必须使用 InteractiveContext 启用 TFX 交互。

https://www.tensorflow.org/tfx/api_docs/python/tfx/orchestration/experimental/interactive/interactive_context/InteractiveContext https://www.tensorflow.org/tfx/api_docs/python/tfx/orchestration/experimental/interactive/interactive_context/InteractiveContext

context = InteractiveContext()
example_gen = CsvExampleGen(input_base='/content/data')
context.run(example_gen)

TFX 组件 CsvExampleGen 总是产生带有空输出（和输入）的示例

问题描述

1 个解决方案

解决方案1
0 2021-01-26 10:02:33

TFX 组件 CsvExampleGen 总是产生带有空输出（和输入）的示例

问题描述

1 个解决方案

解决方案1 0 2021-01-26 10:02:33

解决方案1
0 2021-01-26 10:02:33