How to efficiently feed data into TensorFlow 2.x,

Question

I am looking at a data preprocessing task on a large amount of text data and want to load the preprocessed data into TensorFlow 2.x. The preprocessed data contains arrays of integer values since the preprocessing step generates:

a one hot encoded array as label column
a tokenized list of tokens per data row
an activation mask for usage in transformers

So, I've been thinking I'll use pyspark to pre-process the data and dump the result into a JSON file (since CSV cannot store structured data). Thus far, everythings works out OK. But I am having trouble processing the JSON file in tf.data.Dataset (or anything else that scales as efficient and can interface with TensorFlow 2.x).

I do not want to use/install an additional library (eg TensorFlowOnSpark) besides Tensorflow and PySpark so I am wondering whether its possible to link the two in an efficient way using JSON files since there seems to be no other way for saving/loading records containing a list of data(?). The JSON test file looks like this:

readDF = spark.read.format('json').option('header',True).option('sep','|').load('/output.csv')
readDF.select('label4').show(15, False)

+---------------------------------------------------------+
|label4                                                   |
+---------------------------------------------------------+
|[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]|
|[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
+---------------------------------------------------------+

So, the label4 column has already been one hot encoded and the tokenized text column will look similarly once the tokenizer was applied to it. So, my question is: Can a JSON file be loaded efficiently (maybe via generator function) with tf.data.Dataset or should I go down a different road (with an additional library) for this one?

Answer 1

The tf.data provides several ways to efficiently consume data from different sources. And while I would say a "cleaner" solution might be to handle the pre-processing using TensorFlow itself, let me suggest a couple of ideas for your use case:

1) one-hot encoding

I can see that you preprocess the data and store the entire one-hot encoded vector, which will penalise your data transference, since you will be reading mostly zeros, as opposed to the actual label of interest. I would suggest encoding this as an integer, and transforming it to a one hot encoding using a python generator on ingestion. Alternatively, if you're using a categorical cross-entropy loss function, you can use a label encoding (encode each class as an integer), and use the sparse categorical cross entropy instead.

If you already have one-hot-encoded lists, you can simply use my_list.index(1) to get the label encoding (it's the same as the index of the only 1 in the vector after all.)

2) using a generator

This is totally possible using tf.data . In fact, they provide the from_generator function to wrap python genetaros to be used for ingesting data into your model. As found in the documentation, this is how you would use it:

def gen():
  ragged_tensor = tf.ragged.constant([[1, 2], [3]])
  yield 42, ragged_tensor

dataset = tf.data.Dataset.from_generator(
     gen,
     output_signature=(
         tf.TensorSpec(shape=(), dtype=tf.int32),
         tf.RaggedTensorSpec(shape=(2, None), dtype=tf.int32)))

list(dataset.take(1))

3) consider going back to CSV

If you're working with massive amounts of data, you can probably work around JSON encodings and encode some structure within CSV-like formats, such as TSV, and if you need a list-like column, you can use other separators (for instance, you can separate columns by \t , and then you can separate elements within each column using , or | , or whatever character causes fewer collisions with your existing data.

As an example, let's assume your CSV file has the following structure:

comlumn name 1, column name 2, column name 3, column name 4
0.1,0.2,0.3,0:0:0:1
0.1,0.2,0.3,0:0:1:0
0.1,0.2,0.3,0:1:0:0
...

That is you have 4 columns separated by , , and the 4th colum is in itself a list of values separated by : , which also are one hot representation of 4 classes, a generator that you could use with the code above is:

def my_generator(filename):
    first_line = True
    with open(filename) as f:
        for line in f:
            if first_line:
                # do something to handle the header
                first_line = False
                continue
            fields = line.split(',')
            # here you extract the index of the one-hot encoded class
            label = fields[3].split(':').index(1)
            fields[3] = label
            yield fields # return a list of features and the class

How to efficiently feed data into TensorFlow 2.x,

Question

1 answers

solution1
1 2021-03-17 15:18:00

1) one-hot encoding

2) using a generator

3) consider going back to CSV

How to efficiently feed data into TensorFlow 2.x,

Question

1 answers

solution1 1 2021-03-17 15:18:00

1) one-hot encoding

2) using a generator

3) consider going back to CSV

solution1
1 2021-03-17 15:18:00