简体繁体中英

How to parallelise python input pipeline in Distributed Tensorflow

原文 2018-04-25 12:18:20 4 2 python/ tensorflow/ tensorflow-datasets

I have a non trivial input pipeline, which consists of reading ground truth and raw data and performing preprocessing on them, written in Python. It takes a long time to run the input pipeline for a single sample so I have multiple processes (from python multiprocessing package) running in parallel and queues to perform the operation quickly and prefetch data. The output is then fed to my network using feed_dict. The overhead of this process in my training loop is 2 orders of magnitude less than the actual tf.Session.run() time. I'm trying to move to the tf.data API, by wrapping with tf.py_func my read+preprocess functions but it runs slowly, probably due to GIL, even when increasing the number of multiple calls. I want to scale up my training to multiple machines and am not sure how data fetching behaves in such a case, also there's the performance issue for a single machine as well :)

So, basically my question is: How to run python functions in tf.data api input pipeline in parallel on multiple CPU cores?

2 answers

A couple of clarifications, tf.py_func can run in parallel with your sess.run() (because sess.run() releases the GIL) but you cannot run multiple tf.py_func in the same python process.

The usual answer in such cases is to do the pre-processing once offline, save the results on disk (eg using TFRecord format), read ready data from files during training. You can probably parallelize the offline preprocessing using something like multiprocessing.

If you can express your pre-processing using tf operations, you can run it in parallel using Dataset.map , but there is no built-in support for python multiprocessing in tf.data . If the above does not work for some reason, you would probably have to hook up multiprocessing yourself.

One way to approach this is the following. Have multiple processes produce your inputs, put them into multiprocessing.Queue (or shared memory with some locking around it). Implement the receiving side using a generator function and create a dataset using from_generator .

Recently Google has released Tensorflow Extended (TFX). It essentially consists of:

A set of operators which each use Apache Beam to do data distribution (They call them components).
A standardization of both data and parameter format (what they call protobuf)
Automated dependency management of operators (workflow/orchestration)
Tracking of runs. This allows the system to skip operations that have already been performed under the same conditions.

I would suggest either take a look at TFX. Or, for a more modest leap, look at Apache Beam.

Tensorflow input pipeline for distributed training

How to parallelise loop in Python?

How to parallelise nested for loops in python

How to stack channels in Tensorflow input pipeline?

How to parallelise 2 for loops using 1 function in python

input pipeline in tensorflow

TensorFlow input pipeline performance

Tensorflow input pipeline using text

Python Parallelise Simple For Loop

How to test distributed layers on Tensorflow?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Tensorflow input pipeline for distributed training How to parallelise loop in Python? How to parallelise nested for loops in python How to stack channels in Tensorflow input pipeline? How to parallelise 2 for loops using 1 function in python input pipeline in tensorflow TensorFlow input pipeline performance Tensorflow input pipeline using text Python Parallelise Simple For Loop How to test distributed layers on Tensorflow?

Related Tags

How to parallelise python input pipeline in Distributed Tensorflow

Question

2 answers

solution1
1 ACCPTED 2018-07-20 20:19:42

solution2
0 2019-09-06 02:58:30

How to parallelise python input pipeline in Distributed Tensorflow

Question

2 answers

solution1 1 ACCPTED 2018-07-20 20:19:42

solution2 0 2019-09-06 02:58:30

solution1
1 ACCPTED 2018-07-20 20:19:42

solution2
0 2019-09-06 02:58:30