時間序列分析中的 tf.data.dataset 中的批處理

Question

我正在考慮為時間序列 LSTM model 創建管道。 我有兩個輸入源，我們稱它們為series1和series2 。

我通過調用from.tensor.slices初始化tf.data object ：

ds = tf.data.Dataset.from_tensor_slices((series1, series2))

我將它們進一步分批成一組 windows 大小的 windows 並在 windows 之間移動 1：

ds = ds.window(window_size + 1, shift=1, drop_remainder=True)

在這一點上，我想玩弄它們是如何一起批處理的。 我想產生一個像下面這樣的輸入作為例子：

series1 = [1, 2, 3, 4, 5]
series2 = [100, 200, 300, 400, 500]

batch 1: [1, 2, 100, 200]
batch 2: [2, 3, 200, 300]
batch 3: [3, 4, 300, 400]

所以每批將返回 series1 的兩個元素，然后返回 series2 的兩個元素。 此代碼段不能單獨批處理它們：

ds = ds.map(lambda s1, s2: (s1.batch(window_size + 1), s2.batch(window_size + 1))

因為它返回兩個數據集對象的映射。 由於它們是對象，它們是不可下標的，所以這也不起作用：

ds = ds.map(lambda s1, s2: (s1[:2], s2[:2]))

我確定解決方案是使用.apply和自定義 lambda function。 任何幫助深表感謝。

編輯

我也在考慮生產代表該系列下一個元素的 label。 例如，批次將產生以下內容：

batch 1: (tf.tensor([1, 2, 100, 200]), tf.tensor([3]))
batch 2: (tf.tensor([2, 3, 200, 300]), tf.tensor([4]))
batch 3: (tf.tensor([3, 4, 300, 400]), tf.tensor([5]))

其中[3] 、 [4]和[5]表示要預測的series1的下一個元素。

Answer 1

解決方案是將兩個數據集分開 window， .zip()將它們放在一起，然后.concat()包含 label 的元素。

ds = tf.data.Dataset.from_tensor_slices(series1)
ds = ds.window(window_size + 1, shift=1, drop_remainder=True)
ds = ds.flat_map(lambda window: window.batch(window_size + 1))
ds = ds.map(lambda window: (window[:-1], window[-1]))

ds2 = tf.data.Dataset.from_tensor_slices(series2)
ds2 = ds2.window(window_size, shift=1, drop_remainder=True)
ds2 = ds2.flat_map(lambda window: window.batch(window_size))

ds = tf.data.Dataset.zip((ds, ds2))
ds = ds.map(lambda i, j: (tf.concat([i[0], j], axis=0), i[-1]))

回報：

(<tf.Tensor: shape=(7,), dtype=int32, numpy=array([  1,   2,   3, 100, 200, 300])>, <tf.Tensor: shape=(), dtype=int32, numpy=4>)
(<tf.Tensor: shape=(7,), dtype=int32, numpy=array([  2,   3,   4, 200, 300, 400])>, <tf.Tensor: shape=(), dtype=int32, numpy=5>)
(<tf.Tensor: shape=(7,), dtype=int32, numpy=array([  3,   4,   5, 300, 400, 500])>, <tf.Tensor: shape=(), dtype=int32, numpy=6>)

Answer 2

我認為這是您缺少的行：

ds = ds.batch(2).map(lambda x, y: (tf.concat([x, y], axis=0)))

完整示例：

import tensorflow as tf

series1 = tf.range(1, 16)
series2 = tf.range(100, 1600, 100)

ds = tf.data.Dataset.from_tensor_slices((series1, series2))

ds = ds.batch(2).map(lambda x, y: (tf.concat([x, y], axis=0)))

for row in ds:
    print(row)

tf.Tensor([  1   2 100 200], shape=(4,), dtype=int32)
tf.Tensor([  3   4 300 400], shape=(4,), dtype=int32)
tf.Tensor([  5   6 500 600], shape=(4,), dtype=int32)
tf.Tensor([  7   8 700 800], shape=(4,), dtype=int32)
tf.Tensor([   9   10  900 1000], shape=(4,), dtype=int32)
tf.Tensor([  11   12 1100 1200], shape=(4,), dtype=int32)
tf.Tensor([  13   14 1300 1400], shape=(4,), dtype=int32)

Answer 3

這是我在處理時間序列數據時的解決方案。

dataset = tf.data.Dataset.from_tensor_slices(series)
dataset = dataset.window(window_size + 1, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(window_size + 1))
dataset = dataset.shuffle(shuffle_buffer).map(lambda window: (window[:-1], window[-1]))
dataset = dataset.batch(batch_size).prefetch(1)

以下行對於將 window 拆分為 xs 和 ys 很重要。

dataset.shuffle(shuffle_buffer).map(lambda window: (window[:-1], window[-1]))

雖然使用 shuffle 並不重要，但您只能使用 map function 將 window 拆分為 xs 和 ys。

時間序列分析中的 tf.data.dataset 中的批處理

問題描述

編輯

3 個解決方案

解決方案1
2 已采納 2020-09-02 06:12:33

回報：

解決方案2
1 2020-08-22 08:39:47

解決方案3
0 2020-09-08 23:24:20

時間序列分析中的 tf.data.dataset 中的批處理

問題描述

編輯

3 個解決方案

解決方案1 2 已采納 2020-09-02 06:12:33

回報：

解決方案2 1 2020-08-22 08:39:47

解決方案3 0 2020-09-08 23:24:20

解決方案1
2 已采納 2020-09-02 06:12:33

解決方案2
1 2020-08-22 08:39:47

解決方案3
0 2020-09-08 23:24:20