I am looking at a data preprocessing task on a large amount of text data and want to load the preprocessed data into TensorFlow 2.x. The preprocessed data contains arrays of integer values since the preprocessing step generates:
So, I've been thinking I'll use pyspark to pre-process the data and dump the result into a JSON
file (since CSV cannot store structured data). Thus far, everythings works out OK. But I am having trouble processing the JSON
file in tf.data.Dataset
(or anything else that scales as efficient and can interface with TensorFlow 2.x).
I do not want to use/install an additional library (eg TensorFlowOnSpark) besides Tensorflow and PySpark so I am wondering whether its possible to link the two in an efficient way using JSON files since there seems to be no other way for saving/loading records containing a list of data(?). The JSON test file looks like this:
readDF = spark.read.format('json').option('header',True).option('sep','|').load('/output.csv')
readDF.select('label4').show(15, False)
+---------------------------------------------------------+
|label4 |
+---------------------------------------------------------+
|[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]|
|[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
+---------------------------------------------------------+
So, the label4 column has already been one hot encoded and the tokenized text column will look similarly once the tokenizer was applied to it. So, my question is: Can a JSON
file be loaded efficiently (maybe via generator function) with tf.data.Dataset
or should I go down a different road (with an additional library) for this one?
The tf.data
provides several ways to efficiently consume data from different sources. And while I would say a "cleaner" solution might be to handle the pre-processing using TensorFlow itself, let me suggest a couple of ideas for your use case:
I can see that you preprocess the data and store the entire one-hot encoded vector, which will penalise your data transference, since you will be reading mostly zeros, as opposed to the actual label of interest. I would suggest encoding this as an integer, and transforming it to a one hot encoding using a python generator on ingestion. Alternatively, if you're using a categorical cross-entropy loss function, you can use a label encoding (encode each class as an integer), and use the sparse categorical cross entropy instead.
If you already have one-hot-encoded lists, you can simply use my_list.index(1)
to get the label encoding (it's the same as the index of the only 1 in the vector after all.)
This is totally possible using tf.data
. In fact, they provide the from_generator
function to wrap python genetaros to be used for ingesting data into your model. As found in the documentation, this is how you would use it:
def gen():
ragged_tensor = tf.ragged.constant([[1, 2], [3]])
yield 42, ragged_tensor
dataset = tf.data.Dataset.from_generator(
gen,
output_signature=(
tf.TensorSpec(shape=(), dtype=tf.int32),
tf.RaggedTensorSpec(shape=(2, None), dtype=tf.int32)))
list(dataset.take(1))
If you're working with massive amounts of data, you can probably work around JSON encodings and encode some structure within CSV-like formats, such as TSV, and if you need a list-like column, you can use other separators (for instance, you can separate columns by \t
, and then you can separate elements within each column using ,
or |
, or whatever character causes fewer collisions with your existing data.
As an example, let's assume your CSV file has the following structure:
comlumn name 1, column name 2, column name 3, column name 4
0.1,0.2,0.3,0:0:0:1
0.1,0.2,0.3,0:0:1:0
0.1,0.2,0.3,0:1:0:0
...
That is you have 4 columns separated by ,
, and the 4th colum is in itself a list of values separated by :
, which also are one hot representation of 4 classes, a generator that you could use with the code above is:
def my_generator(filename):
first_line = True
with open(filename) as f:
for line in f:
if first_line:
# do something to handle the header
first_line = False
continue
fields = line.split(',')
# here you extract the index of the one-hot encoded class
label = fields[3].split(':').index(1)
fields[3] = label
yield fields # return a list of features and the class
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.