简体   繁体   English

如何在 tf 数据集上应用文本矢量化后创建滑动 window?

[英]How to create a sliding window after applying Text Vectorization on tf Datasets?

I am reading a large text file using TensorFlow's TextLineDataset .我正在使用 TensorFlow 的TextLineDataset读取一个大文本文件。 I want to tokenize the dataset and create a sliding window and separate the tokenized text into two parts - input and label. If the text file has the following texts:我想标记数据集并创建一个滑动 window 并将标记化文本分为两部分 - 输入和 label。如果文本文件具有以下文本:

Lorem ipsum dolor sit amet...

then I want to create sequences of a specified length pre-padded with 0's.然后我想创建预先填充 0 的指定长度的序列。 I want to iterate over the text and use all but the last as input and the last one as the label. So, my target is to first tokenize the texts as something like this:我想遍历文本并将除最后一个以外的所有文本用作输入,最后一个用作 label。因此,我的目标是首先将文本标记为如下所示:

Lorem: 1,
ipsum: 2,
dolor: 3,
sit: 4,
amet: 5,

Then create a sequence of let's say a length of 5 like this to train a model:然后创建一个长度为 5 的序列来训练 model:

X_train = [[0, 0, 0, 0, 1], [0, 0, 0, 1, 2], [0, 0, 1, 2, 3], ...]
y_train = [2, 3, 4, ...] # next word of the sequence in X_train

I am using TextVectorization to tokenize but cannot figure out an efficient way to create the inputs and labels for a large dataset.我正在使用TextVectorization进行标记化,但无法找到一种有效的方法来为大型数据集创建输入和标签。

vectorize_layer = tf.keras.layers.TextVectorization(output_mode='int',
train_data = train_data.map(vectorize_layer)

Using a for loop over the dataset would make the device run out of memory trying to allocate a large amount of memory. What is the best way to do this?在数据集上使用 for 循环会使设备用完 memory 并尝试分配大量 memory。执行此操作的最佳方法是什么?

You could use the sliding window function from tensorflow-text ;您可以使用来自tensorflow-text的滑动 window function however, the TextVectorization layer seems to only apply post-padding:然而, TextVectorization层似乎只应用后填充:

import tensorflow as tf
import tensorflow_text as tft

with open('data.txt', 'w') as f:
  f.write('Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam efficitur viverra lacus?\n')

train_data = tf.data.TextLineDataset(['/content/data.txt'])

vectorize_layer = tf.keras.layers.TextVectorization(output_mode='int', max_tokens=50, pad_to_max_tokens=True)

window_size = 5

def sliding_window(x):
  encoded = vectorize_layer(x)
  x = tft.sliding_window(encoded, width=window_size, axis=0)
  y = tft.sliding_window(encoded, width=window_size + 1, axis=0)[:, -1]
  return x[:tf.shape(y)[0],:], y

train_data = train_data.map(sliding_window)

vocab = tf.constant(vectorize_layer.get_vocabulary())
keys = tf.cast(tf.range(vocab.shape[0]), tf.int64)
table = tf.lookup.StaticHashTable(
    tf.lookup.KeyValueTensorInitializer(keys, vocab),

train_data = tf.data.Dataset.zip((train_data.map(lambda x, y: x).flat_map(tf.data.Dataset.from_tensor_slices),
                                 train_data.map(lambda x, y: y).flat_map(tf.data.Dataset.from_tensor_slices)))

for x, y in train_data:
  print('x -->', x, 'y -->', y)
  print('x -->', table.lookup(x), 'y -->', table.lookup(y), '\n')

x --> tf.Tensor([ 4  6  9  3 11], shape=(5,), dtype=int64) y --> tf.Tensor(10, shape=(), dtype=int64)
x --> tf.Tensor([b'lorem' b'ipsum' b'dolor' b'sit' b'amet'], shape=(5,), dtype=string) y --> tf.Tensor(b'consectetur', shape=(), dtype=string) 

x --> tf.Tensor([ 6  9  3 11 10], shape=(5,), dtype=int64) y --> tf.Tensor(13, shape=(), dtype=int64)
x --> tf.Tensor([b'ipsum' b'dolor' b'sit' b'amet' b'consectetur'], shape=(5,), dtype=string) y --> tf.Tensor(b'adipiscing', shape=(), dtype=string) 

x --> tf.Tensor([ 9  3 11 10 13], shape=(5,), dtype=int64) y --> tf.Tensor(7, shape=(), dtype=int64)
x --> tf.Tensor([b'dolor' b'sit' b'amet' b'consectetur' b'adipiscing'], shape=(5,), dtype=string) y --> tf.Tensor(b'elit', shape=(), dtype=string) 

x --> tf.Tensor([ 3 11 10 13  7], shape=(5,), dtype=int64) y --> tf.Tensor(12, shape=(), dtype=int64)
x --> tf.Tensor([b'sit' b'amet' b'consectetur' b'adipiscing' b'elit'], shape=(5,), dtype=string) y --> tf.Tensor(b'aliquam', shape=(), dtype=string) 

x --> tf.Tensor([11 10 13  7 12], shape=(5,), dtype=int64) y --> tf.Tensor(8, shape=(), dtype=int64)
x --> tf.Tensor([b'amet' b'consectetur' b'adipiscing' b'elit' b'aliquam'], shape=(5,), dtype=string) y --> tf.Tensor(b'efficitur', shape=(), dtype=string) 

x --> tf.Tensor([10 13  7 12  8], shape=(5,), dtype=int64) y --> tf.Tensor(2, shape=(), dtype=int64)
x --> tf.Tensor([b'consectetur' b'adipiscing' b'elit' b'aliquam' b'efficitur'], shape=(5,), dtype=string) y --> tf.Tensor(b'viverra', shape=(), dtype=string) 

x --> tf.Tensor([13  7 12  8  2], shape=(5,), dtype=int64) y --> tf.Tensor(5, shape=(), dtype=int64)
x --> tf.Tensor([b'adipiscing' b'elit' b'aliquam' b'efficitur' b'viverra'], shape=(5,), dtype=string) y --> tf.Tensor(b'lacus', shape=(), dtype=string) 

Note that sequences that do not have a corresponding label are discarded with the line x[:tf.shape(y)[0],:] .请注意,没有相应的 label 的序列将被x[:tf.shape(y)[0],:]行丢弃。 Also, the lookup table is only for demonstration purposes and not needed to achieve what you want.此外,查找表仅用于演示目的,不需要实现您想要的。 You can look at tft.pad_along_dimension if you want to apply pre-padding.如果您想应用预填充,可以查看tft.pad_along_dimension

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM