简体   繁体   中英

TFX component CsvExampleGen always yields Examples with empty outputs (and inputs)

I can run CsvExampleGen without an error message, but the outputs (and inputs) of the resulting Examples are always empty.

I am using tfx==0.24.0.

To use CsvExampleGen for reading CSV files, according to the docu & tutorials (incl. https://www.tensorflow.org/tfx/guide/examplegen ) + the release notes for tfx 0.23.0/0.24.0 ( https://github.com/tensorflow/tfx/releases ), the following lines of code should suffice to read a CVS file:

from tfx.components import CsvExampleGen
example_gen = CsvExampleGen(input_base=data_path)

where "data_path" identifies a directory with CVS files. (Note that the code differs from the official docu in that is does not use "external_input"; instead it follows the new interface documented in the release notes for 0.23.0.)

From tutorials I gather that a single, simple CVS file should suffice for testing (though I tried with up to 7 files).

I do not get any error message (except for one which I am told to ignore if I don't have a GPU available); however, the outputs (and inputs) of the resulting structure are empty (empty list and empty set / dict, respectively). I think they should not be empty, however.

The CSV files in question ARE found and touched, because if I introduce an error there (like an additional column in one row), I do get an error message.

I tried this with a stand-alone function as well as inside a pipeline (run with BeamDagRunner, for simplicity). The pipeline does generate a metadata.db, but I cannot find any trace of the CSV data there (like column names). Adding a StatisticsGen to the pipeline didn't help any further.

I tried this with the iris dataset, with and without column headers. I also tried with up to 7 small, artificial CVS files within data_path, alternatively with purely numerical and mixed numerical/categorial data and alternatively with commas and semicolons as separators. The result is always the same.

Do I have a problem with the code, or maybe with some configuration or libraries?

Here is the full code (as far as possibly relevant):

PIPELINE_NAME = "X-pipeline-iris2"
BASE_PATH = r"C:\***\FX_Experiments"
BASE_PATH_PIPELINE = os.path.join(BASE_PATH, "pipeline")
BASE_PATH_TESTS = os.path.join(BASE_PATH, "tests")
PIPELINE_ROOT = os.path.join(BASE_PATH_PIPELINE, "output")
METADATA_PATH = os.path.join(BASE_PATH_PIPELINE, "tfx_metadata", PIPELINE_NAME, "metadata.db")
DATA_PATH = os.path.join(BASE_PATH_TESTS, "iris2")
ENABLE_CACHE = True


def create_pipeline(
        pipeline_name: Text, pipeline_root: Text, data_path: Text,
        enable_cache: bool,
        metadata_connection_config: Optional[metadata_store_pb2.ConnectionConfig] = None,
        beam_pipeline_args: Optional[List[Text]] = None
):
    components = []

    example_gen = CsvExampleGen(input_base=data_path)
    components.append(example_gen)

    stat_gen = StatisticsGen(examples=example_gen.outputs['examples'])
    components.append(stat_gen)

    return pipeline.Pipeline(
        pipeline_name = pipeline_name,
        pipeline_root = pipeline_root,
        components = components,
        enable_cache = enable_cache,
        metadata_connection_config = metadata_connection_config,
        beam_pipeline_args = beam_pipeline_args
    )

def run_pipeline():
    this_pipeline = create_pipeline(
        pipeline_name=PIPELINE_NAME,
        pipeline_root=PIPELINE_ROOT,
        data_path=DATA_PATH,
        enable_cache=ENABLE_CACHE,
        metadata_connection_config=metadata.sqlite_metadata_connection_config(METADATA_PATH)
    )
    BeamDagRunner().run(this_pipeline)

Also potentially useful: logger info:

INFO:absl:Excluding no splits because exclude_splits is not set.
INFO:absl:Component CsvExampleGen depends on [].
INFO:absl:Component CsvExampleGen is scheduled.
INFO:absl:Component StatisticsGen depends on ['Run[CsvExampleGen]'].
INFO:absl:Component StatisticsGen is scheduled.
INFO:absl:Component CsvExampleGen is running.
INFO:absl:Running driver for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:select span and version = (0, None)
INFO:absl:latest span and version = (0, None)
INFO:absl:Running publisher for CsvExampleGen
INFO:absl:MetadataStore with DB connection initialized
INFO:absl:Component CsvExampleGen is finished.
INFO:absl:Component StatisticsGen is running.
...

Felix, if you follow the guides you probably running your code in a notebook. If you want to see the results directly you have to enable TFX interactive using InteractiveContext.

https://www.tensorflow.org/tfx/api_docs/python/tfx/orchestration/experimental/interactive/interactive_context/InteractiveContext

context = InteractiveContext()
example_gen = CsvExampleGen(input_base='/content/data')
context.run(example_gen)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM