简体   繁体   English

TFX 组件 CsvExampleGen 总是产生带有空输出(和输入)的示例

[英]TFX component CsvExampleGen always yields Examples with empty outputs (and inputs)

I can run CsvExampleGen without an error message, but the outputs (and inputs) of the resulting Examples are always empty.我可以在没有错误消息的情况下运行 CsvExampleGen,但结果示例的输出(和输入)始终为空。

I am using tfx==0.24.0.我正在使用 tfx==0.24.0。

To use CsvExampleGen for reading CSV files, according to the docu & tutorials (incl. https://www.tensorflow.org/tfx/guide/examplegen ) + the release notes for tfx 0.23.0/0.24.0 ( https://github.com/tensorflow/tfx/releases ), the following lines of code should suffice to read a CVS file:要使用CsvExampleGen读取的CSV文件,根据实况和教程(包括https://www.tensorflow.org/tfx/guide/examplegen +对TFX 0.23.0 / 0.24.0(发行说明)的https:/ /github.com/tensorflow/tfx/releases ),以下代码行应该足以读取 CVS 文件:

from tfx.components import CsvExampleGen
example_gen = CsvExampleGen(input_base=data_path)

where "data_path" identifies a directory with CVS files.其中“data_path”标识包含 CVS 文件的目录。 (Note that the code differs from the official docu in that is does not use "external_input"; instead it follows the new interface documented in the release notes for 0.23.0.) (请注意,代码与官方文档的不同之处在于不使用“external_input”;而是遵循 0.23.0 发行说明中记录的新界面。)

From tutorials I gather that a single, simple CVS file should suffice for testing (though I tried with up to 7 files).从教程中,我认为一个简单的 CVS 文件应该足以进行测试(尽管我尝试了多达 7 个文件)。

I do not get any error message (except for one which I am told to ignore if I don't have a GPU available);我没有收到任何错误消息(除了一个我被告知如果我没有可用的 GPU 时忽略的消息); however, the outputs (and inputs) of the resulting structure are empty (empty list and empty set / dict, respectively).然而,结果结构的输出(和输入)是空的(分别是空列表和空集/字典)。 I think they should not be empty, however.但是,我认为它们不应该是空的。

The CSV files in question ARE found and touched, because if I introduce an error there (like an additional column in one row), I do get an error message.有问题的 CSV 文件被找到并被触及,因为如果我在那里引入一个错误(比如一行中的附加列),我会收到一条错误消息。

I tried this with a stand-alone function as well as inside a pipeline (run with BeamDagRunner, for simplicity).我在一个独立的函数和管道中尝试了这个(为了简单起见,用 BeamDagRunner 运行)。 The pipeline does generate a metadata.db, but I cannot find any trace of the CSV data there (like column names).管道确实生成了一个 metadata.db,但我在那里找不到任何 CSV 数据的痕迹(如列名)。 Adding a StatisticsGen to the pipeline didn't help any further.向管道添加 StatisticsGen 没有任何帮助。

I tried this with the iris dataset, with and without column headers.我用 iris 数据集尝试了这个,有和没有列标题。 I also tried with up to 7 small, artificial CVS files within data_path, alternatively with purely numerical and mixed numerical/categorial data and alternatively with commas and semicolons as separators.我还尝试在 data_path 中使用最多 7 个小型人工 CVS 文件,或者使用纯数字和混合数字/分类数据,或者使用逗号和分号作为分隔符。 The result is always the same.结果总是一样的。

Do I have a problem with the code, or maybe with some configuration or libraries?我的代码有问题,还是某些配置或库有问题?

Here is the full code (as far as possibly relevant):这是完整的代码(尽可能相关):

PIPELINE_NAME = "X-pipeline-iris2"
BASE_PATH = r"C:\***\FX_Experiments"
BASE_PATH_PIPELINE = os.path.join(BASE_PATH, "pipeline")
BASE_PATH_TESTS = os.path.join(BASE_PATH, "tests")
PIPELINE_ROOT = os.path.join(BASE_PATH_PIPELINE, "output")
METADATA_PATH = os.path.join(BASE_PATH_PIPELINE, "tfx_metadata", PIPELINE_NAME, "metadata.db")
DATA_PATH = os.path.join(BASE_PATH_TESTS, "iris2")
ENABLE_CACHE = True


def create_pipeline(
        pipeline_name: Text, pipeline_root: Text, data_path: Text,
        enable_cache: bool,
        metadata_connection_config: Optional[metadata_store_pb2.ConnectionConfig] = None,
        beam_pipeline_args: Optional[List[Text]] = None
):
    components = []

    example_gen = CsvExampleGen(input_base=data_path)
    components.append(example_gen)

    stat_gen = StatisticsGen(examples=example_gen.outputs['examples'])
    components.append(stat_gen)

    return pipeline.Pipeline(
        pipeline_name = pipeline_name,
        pipeline_root = pipeline_root,
        components = components,
        enable_cache = enable_cache,
        metadata_connection_config = metadata_connection_config,
        beam_pipeline_args = beam_pipeline_args
    )

def run_pipeline():
    this_pipeline = create_pipeline(
        pipeline_name=PIPELINE_NAME,
        pipeline_root=PIPELINE_ROOT,
        data_path=DATA_PATH,
        enable_cache=ENABLE_CACHE,
        metadata_connection_config=metadata.sqlite_metadata_connection_config(METADATA_PATH)
    )
    BeamDagRunner().run(this_pipeline)

Also potentially useful: logger info:也可能有用:记录器信息:

INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Component CsvExampleGen depends on [].
INFO:absl:Component CsvExampleGen is scheduled.
INFO:absl:Component StatisticsGen depends on ['Run[CsvExampleGen]'].
INFO:absl:Component StatisticsGen is scheduled.
INFO:absl:Component CsvExampleGen is running.
INFO:absl:Running driver for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:Running publisher for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component CsvExampleGen is finished.
INFO:absl:Component StatisticsGen is running.
...

Felix, if you follow the guides you probably running your code in a notebook. Felix,如果您遵循指南,您可能会在笔记本中运行您的代码。 If you want to see the results directly you have to enable TFX interactive using InteractiveContext.如果您想直接查看结果,您必须使用 InteractiveContext 启用 TFX 交互。

https://www.tensorflow.org/tfx/api_docs/python/tfx/orchestration/experimental/interactive/interactive_context/InteractiveContext https://www.tensorflow.org/tfx/api_docs/python/tfx/orchestration/experimental/interactive/interactive_context/InteractiveContext

context = InteractiveContext()
example_gen = CsvExampleGen(input_base='/content/data')
context.run(example_gen)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM