简体   繁体   English

来自 CSV 文件的具有多维张量的 TensorFlow 数据集

[英]TensorFlow dataset with multi-dimensional Tensors from a CSV file

Is there a way, and if yes, what it is, to load a TensorFlow dataset with multi-dimensional feature Tensor from a CSV (or other format input) file?有没有办法,如果是,它是什么,从 CSV(或其他格式输入)文件加载具有多维特征张量的 TensorFlow 数据集?

For example, my CSV input looks like the following:例如,我的 CSV 输入如下所示:

f1,  f2,  f3,                      label
0.1, 0.2, 0.1;0.2;0.3;1.1;1.2;1.3, 1
0.2, 0.3, 0.2;0.3;0.4;1.2;1.3;1.4, 0
0.3, 0.4, 0.3;0.4;0.5;1.3;1.4;1.5, 1

I'd like load a dataset from such file, eg我想从这样的文件中加载数据集,例如

import tensorflow as tf

frames_csv_ds = tf.data.experimental.make_csv_dataset(
    'input.csv',
    header=False,
    column_names=['f1','f2','f3','label'],
    batch_size=5,
    label_name='label',
    num_epochs=1,
    ignore_errors=True,)

for batch, label in frames_csv_ds.take(1):
  for key, value in batch.items():
    print(f"{key:20s}: {value}")
  print()
  print(f"{'label':20s}: {label}")

To get the batch as:获取批次为:

f1 : [0.1   0.2   0.3  ]
f2 : [0.2   0.3   0.4  ]
f3 : [ [[0.1, 0.2, 0.3], [1.1, 1.2, 1.3]], [[0.2, 0.3, 0.4], [1.2, 1.3, 1.4]], [[0.3, 0.4, 0.5], [1.3, 1.4, 1.5]] ]
label : [1, 0, 1]

The snippet above is incomplete and doesn't work.上面的代码片段不完整并且不起作用。 Is there away to get the dataset in the illustrated form?有没有办法以图示的形式获取数据集? If yes, can this be done for arrays of dimensions varying across the dataset?如果是,是否可以针对整个数据集中不同维度的 arrays 执行此操作?

Well, you can do this by customizing some Tensorflow Functions那么,你可以通过自定义一些 Tensorflow 函数来做到这一点

import tensorflow as tf

file_path = "data.csv"
dataset = tf.data.TextLineDataset(file_path).skip(1)

def parse_csv_line(line):
  # Split the line into a list of strings
  fields = tf.io.decode_csv(line, record_defaults=[[""]] * 4)
  
  f1 = tf.strings.to_number(fields[0], tf.float32)
  f2 = tf.strings.to_number(fields[1], tf.float32)
  f3 = tf.strings.to_number(tf.strings.split(fields[2], ";"), tf.float32)
  label = tf.strings.to_number(fields[3], tf.int32)
  
  return {"f1": f1, "f2": f2, "f3": f3, "label": label}

dataset = dataset.map(parse_csv_line).batch(5)
next(iter(dataset.take(1)))
{'f1': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0.1, 0.2, 0.3], dtype=float32)>,
 'f2': <tf.Tensor: shape=(3,), dtype=float32, numpy=array([0.2, 0.3, 0.4], dtype=float32)>,
 'f3': <tf.Tensor: shape=(3, 6), dtype=float32, numpy=
 array([[0.1, 0.2, 0.3, 1.1, 1.2, 1.3],
        [0.2, 0.3, 0.4, 1.2, 1.3, 1.4],
        [0.3, 0.4, 0.5, 1.3, 1.4, 1.5]], dtype=float32)>,
 'label': <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 0, 1], dtype=int32)>}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM