如何将大数据集从 CSV 加载到 keras

Question

I'm trying to use Keras with TensorFlow to train a network based on the SURF features that I obtained from several images.我正在尝试将 Keras 与 TensorFlow 结合使用，以根据我从多张图像中获得的 SURF 特征来训练网络。 I have all this features stored in a CSV file that has the following columns:我将所有这些功能存储在一个包含以下列的 CSV 文件中：

 [ID, Code, PointX, PointY, Desc1, ..., Desc64]

The "ID" column is an autoincremental index created by pandas when I store all the values. “ID”列是我存储所有值时由熊猫创建的自动增量索引。 The "Code" column is the label of the point, this would be just a number that I got by pairing the actual code (which is a string) with a number. “代码”列是点的标签，这只是我通过将实际代码（它是一个字符串）与一个数字配对得到的一个数字。 "PointX/Y" are the coordinates of the point found in an image of a given class, and "Desc#" is the float value of the corresponding descriptor of that point. “PointX/Y”是在给定类的图像中找到的点的坐标，“Desc#”是该点对应描述符的浮点值。

The CSV file contains all the KeyPoints and Descriptors found in all 20.000 images. CSV 文件包含在所有 20.000 个图像中找到的所有关键点和描述符。 This gives me a total size of almost 60GB in disk, which I obviously can't fit into memory.这使我的磁盘总大小接近 60GB，显然我无法放入内存中。

I've been trying to load batches of the file using pandas, then put all the values in a numpy array, and then fitting my model (a Sequential model of only 3 layers).我一直在尝试使用 Pandas 批量加载文件，然后将所有值放在一个 numpy 数组中，然后拟合我的模型（只有 3 层的序列模型）。 I've used the following code to do so:我使用以下代码来做到这一点：

chunksize = 10 ** 6
for chunk in pd.read_csv("surf_kps.csv", chunksize=chunksize):
    dataset_chunk = chunk.to_numpy(dtype=np.float32, copy=False)
    print(dataset_chunk)
    # Divide dataset in data and labels
    X = dataset_chunk[:,9:]
    Y = dataset_chunk[:,1]
    # Train model
    model.fit(x=X,y=Y,batch_size=200,epochs=20)
    # Evaluate model
    scores = model.evaluate(X, Y)
    print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

This is alright with the first chunk loaded, but when the loop gets another chunk, accuracy and loss stuck on 0.这在加载第一个块时没问题，但是当循环获取另一个块时，准确性和损失卡在 0 上。

Is it wrong the way I'm trying to load all this information?我试图加载所有这些信息的方式是错误的吗？

Thanks in advance!提前致谢！

------ EDIT ------ - - - 编辑 - - -

Ok, now I made a simple generator like this:好的，现在我做了一个简单的生成器，如下所示：

def read_csv(filename):
    with open(filename, 'r') as f:
        for line in f.readlines():
            record = line.rstrip().split(',')
            features = [np.float32(n) for n in record[9:73]]
            label = int(record[1])
            print("features: ",type(features[0]), " ", type(label))
            yield np.array(features), label

and use fit_generator with it:并使用 fit_generator ：

tf_ds = read_csv("mini_surf_kps.csv")
model.fit_generator(tf_ds,steps_per_epoch=1000,epochs=20)

I don't know why, but I keep getting an error just before the first epoch starts:我不知道为什么，但在第一个纪元开始之前我一直收到错误消息：

ValueError: Error when checking input: expected dense_input to have shape (64,) but got array with shape (1,)

The first layer of the model has input_dim=64 and the shape of the features array yielded is also 64.模型的第一层input_dim=64 ，生成的特征数组的形状也是 64。

Answer 1

I think it is better to use tf.data.Dataset , this may help:我认为最好使用tf.data.Dataset ，这可能会有所帮助：

Answer 2

If you are using Tf 2.0, you could verify if the contents of the dataset are right.如果您使用的是 Tf 2.0，您可以验证数据集的内容是否正确。 You can simply do this by ,你可以简单地做到这一点，

print(next(iter(tf_ds)))

to see the first element of the dataset and check if it matches the input expected by the model.查看数据集的第一个元素并检查它是否与模型预期的输入匹配。

Answer 3

There's a great tutorial about this, visit:有一个很棒的教程，请访问：

https://machinelearningmastery.com/load-machine-learning-data-python/ https://machinelearningmastery.com/load-machine-learning-data-python/

 # Load CSV (using python) import csv import numpy filename = 'pima-indians-diabetes.data.csv' raw_data = open(filename, 'rt') reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE) x = list(reader) data = numpy.array(x).astype('float') print(data.shape)

Simply:简单地：

   import numpy
   filename = 'pima-indians-diabetes.data.csv'
   raw_data = open(filename, 'rt')
   data = numpy.loadtxt(raw_data, delimiter=",")
   print(data.shape)

如何将大数据集从 CSV 加载到 keras

问题描述

2 个解决方案

解决方案1
2 2019-07-24 23:12:19

解决方案2
0 2019-07-29 14:41:48

解决方案3
0 2020-09-20 04:59:10

如何将大数据集从 CSV 加载到 keras

问题描述

2 个解决方案

解决方案1 2 2019-07-24 23:12:19

解决方案2 0 2019-07-29 14:41:48

解决方案3 0 2020-09-20 04:59:10

解决方案1
2 2019-07-24 23:12:19

解决方案2
0 2019-07-29 14:41:48

解决方案3
0 2020-09-20 04:59:10