简体   繁体   English

如何有效地将数据输入 TensorFlow 2.x,

[英]How to efficiently feed data into TensorFlow 2.x,

I am looking at a data preprocessing task on a large amount of text data and want to load the preprocessed data into TensorFlow 2.x.我正在查看对大量文本数据的数据预处理任务,并希望将预处理后的数据加载到 TensorFlow 2.x 中。 The preprocessed data contains arrays of integer values since the preprocessing step generates:预处理数据包含 integer 值的 arrays 值,因为预处理步骤生成:

  • a one hot encoded array as label column一个热编码数组作为 label 列
  • a tokenized list of tokens per data row每个数据行的标记化标记列表
  • an activation mask for usage in transformers用于变压器的激活掩码

So, I've been thinking I'll use pyspark to pre-process the data and dump the result into a JSON file (since CSV cannot store structured data).所以,我一直在想我会使用 pyspark 来预处理数据并将结果转储到JSON文件中(因为 CSV 无法存储结构化数据)。 Thus far, everythings works out OK.到目前为止,一切正常。 But I am having trouble processing the JSON file in tf.data.Dataset (or anything else that scales as efficient and can interface with TensorFlow 2.x).但是我在处理 tf.data.Dataset 中的JSON文件时遇到了tf.data.Dataset (或任何其他可以有效扩展并可以与 TensorFlow 2.x 接口的文件)。

I do not want to use/install an additional library (eg TensorFlowOnSpark) besides Tensorflow and PySpark so I am wondering whether its possible to link the two in an efficient way using JSON files since there seems to be no other way for saving/loading records containing a list of data(?).除了 Tensorflow 和 PySpark 之外,我不想使用/安装其他库(例如 TensorFlowOnSpark),所以我想知道是否有可能使用 Z0ECD11C1D7A287401D148A23BBD7 以有效的方式链接两者,因为没有其他方式保存/加载文件包含数据列表(?)。 The JSON test file looks like this: JSON 测试文件如下所示:

readDF = spark.read.format('json').option('header',True).option('sep','|').load('/output.csv')
readDF.select('label4').show(15, False)

+---------------------------------------------------------+
|label4                                                   |
+---------------------------------------------------------+
|[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]|
|[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]|
|[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]|
+---------------------------------------------------------+

So, the label4 column has already been one hot encoded and the tokenized text column will look similarly once the tokenizer was applied to it.因此, label4列已经进行了一次热编码,并且一旦将标记器应用于它,标记化的文本列将看起来相似。 So, my question is: Can a JSON file be loaded efficiently (maybe via generator function) with tf.data.Dataset or should I go down a different road (with an additional library) for this one?所以,我的问题是: JSON文件是否可以使用tf.data.Dataset有效加载(可能通过生成器函数),或者我应该为这个文件走一条不同的道路(带有附加库)?

The tf.data provides several ways to efficiently consume data from different sources. tf.data提供了多种方法来有效地使用来自不同来源的数据。 And while I would say a "cleaner" solution might be to handle the pre-processing using TensorFlow itself, let me suggest a couple of ideas for your use case:虽然我会说“更清洁”的解决方案可能是使用 TensorFlow 本身来处理预处理,但让我为您的用例提出一些想法:

1) one-hot encoding 1) one-hot编码

I can see that you preprocess the data and store the entire one-hot encoded vector, which will penalise your data transference, since you will be reading mostly zeros, as opposed to the actual label of interest.我可以看到您对数据进行了预处理并存储了整个 one-hot 编码向量,这将惩罚您的数据传输,因为您将主要读取零,而不是实际感兴趣的 label。 I would suggest encoding this as an integer, and transforming it to a one hot encoding using a python generator on ingestion.我建议将其编码为 integer,并在摄取时使用 python 生成器将其转换为单热编码。 Alternatively, if you're using a categorical cross-entropy loss function, you can use a label encoding (encode each class as an integer), and use the sparse categorical cross entropy instead.或者,如果您使用分类交叉熵损失 function,则可以使用 label 编码(将每个 class 编码为整数),并改用分类交叉熵),

If you already have one-hot-encoded lists, you can simply use my_list.index(1) to get the label encoding (it's the same as the index of the only 1 in the vector after all.)如果您已经有 one-hot-encoded 列表,您可以简单地使用my_list.index(1)来获取 label 编码(毕竟它与向量中唯一的 1 的索引相同。)

2) using a generator 2) 使用生成器

This is totally possible using tf.data .这完全可以使用tf.data In fact, they provide the from_generator function to wrap python genetaros to be used for ingesting data into your model.事实上,他们提供了from_generator function 来包装 python 生成器,用于将数据摄取到 model 中。 As found in the documentation, this is how you would use it:如文档中所述,这是您使用它的方式:

def gen():
  ragged_tensor = tf.ragged.constant([[1, 2], [3]])
  yield 42, ragged_tensor

dataset = tf.data.Dataset.from_generator(
     gen,
     output_signature=(
         tf.TensorSpec(shape=(), dtype=tf.int32),
         tf.RaggedTensorSpec(shape=(2, None), dtype=tf.int32)))

list(dataset.take(1))

3) consider going back to CSV 3)考虑回到CSV

If you're working with massive amounts of data, you can probably work around JSON encodings and encode some structure within CSV-like formats, such as TSV, and if you need a list-like column, you can use other separators (for instance, you can separate columns by \t , and then you can separate elements within each column using , or | , or whatever character causes fewer collisions with your existing data.如果您正在处理大量数据,您可能可以使用 JSON 编码并在类似 CSV 的格式中编码某些结构,例如 TSV,如果您需要类似列表的列,您可以使用其他分隔符(例如,您可以通过\t分隔列,然后您可以使用,|或任何字符分隔每列中的元素,从而减少与现有数据的冲突。

As an example, let's assume your CSV file has the following structure:例如,假设您的 CSV 文件具有以下结构:

comlumn name 1, column name 2, column name 3, column name 4
0.1,0.2,0.3,0:0:0:1
0.1,0.2,0.3,0:0:1:0
0.1,0.2,0.3,0:1:0:0
...

That is you have 4 columns separated by , , and the 4th colum is in itself a list of values separated by : , which also are one hot representation of 4 classes, a generator that you could use with the code above is:也就是说,您有 4 列由,分隔,而第 4 列本身就是一个由:分隔的值列表,这也是 4 个类的一个热门表示,您可以与上面的代码一起使用的生成器是:

def my_generator(filename):
    first_line = True
    with open(filename) as f:
        for line in f:
            if first_line:
                # do something to handle the header
                first_line = False
                continue
            fields = line.split(',')
            # here you extract the index of the one-hot encoded class
            label = fields[3].split(':').index(1)
            fields[3] = label
            yield fields # return a list of features and the class

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM