简体   繁体   English

如何将大数据集从 CSV 加载到 keras

[英]How to load big dataset from CSV into keras

I'm trying to use Keras with TensorFlow to train a network based on the SURF features that I obtained from several images.我正在尝试将 Keras 与 TensorFlow 结合使用,以根据我从多张图像中获得的 SURF 特征来训练网络。 I have all this features stored in a CSV file that has the following columns:我将所有这些功能存储在一个包含以下列的 CSV 文件中:

 [ID, Code, PointX, PointY, Desc1, ..., Desc64]

The "ID" column is an autoincremental index created by pandas when I store all the values. “ID”列是我存储所有值时由熊猫创建的自动增量索引。 The "Code" column is the label of the point, this would be just a number that I got by pairing the actual code (which is a string) with a number. “代码”列是点的标签,这只是我通过将实际代码(它是一个字符串)与一个数字配对得到的一个数字。 "PointX/Y" are the coordinates of the point found in an image of a given class, and "Desc#" is the float value of the corresponding descriptor of that point. “PointX/Y”是在给定类的图像中找到的点的坐标,“Desc#”是该点对应描述符的浮点值。

The CSV file contains all the KeyPoints and Descriptors found in all 20.000 images. CSV 文件包含在所有 20.000 个图像中找到的所有关键点和描述符。 This gives me a total size of almost 60GB in disk, which I obviously can't fit into memory.这使我的磁盘总大小接近 60GB,显然我无法放入内存中。

I've been trying to load batches of the file using pandas, then put all the values in a numpy array, and then fitting my model (a Sequential model of only 3 layers).我一直在尝试使用 Pandas 批量加载文件,然后将所有值放在一个 numpy 数组中,然后拟合我的模型(只有 3 层的序列模型)。 I've used the following code to do so:我使用以下代码来做到这一点:

chunksize = 10 ** 6
for chunk in pd.read_csv("surf_kps.csv", chunksize=chunksize):
    dataset_chunk = chunk.to_numpy(dtype=np.float32, copy=False)
    print(dataset_chunk)
    # Divide dataset in data and labels
    X = dataset_chunk[:,9:]
    Y = dataset_chunk[:,1]
    # Train model
    model.fit(x=X,y=Y,batch_size=200,epochs=20)
    # Evaluate model
    scores = model.evaluate(X, Y)
    print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

This is alright with the first chunk loaded, but when the loop gets another chunk, accuracy and loss stuck on 0.这在加载第一个块时没问题,但是当循环获取另一个块时,准确性和损失卡在 0 上。

Is it wrong the way I'm trying to load all this information?我试图加载所有这些信息的方式是错误的吗?

Thanks in advance!提前致谢!

------ EDIT ------ - - - 编辑 - - -

Ok, now I made a simple generator like this:好的,现在我做了一个简单的生成器,如下所示:

def read_csv(filename):
    with open(filename, 'r') as f:
        for line in f.readlines():
            record = line.rstrip().split(',')
            features = [np.float32(n) for n in record[9:73]]
            label = int(record[1])
            print("features: ",type(features[0]), " ", type(label))
            yield np.array(features), label

and use fit_generator with it:并使用 fit_generator :

tf_ds = read_csv("mini_surf_kps.csv")
model.fit_generator(tf_ds,steps_per_epoch=1000,epochs=20)

I don't know why, but I keep getting an error just before the first epoch starts:我不知道为什么,但在第一个纪元开始之前我一直收到错误消息:

ValueError: Error when checking input: expected dense_input to have shape (64,) but got array with shape (1,)

The first layer of the model has input_dim=64 and the shape of the features array yielded is also 64.模型的第一层input_dim=64 ,生成的特征数组的形状也是 64。

If you are using Tf 2.0, you could verify if the contents of the dataset are right.如果您使用的是 Tf 2.0,您可以验证数据集的内容是否正确。 You can simply do this by ,你可以简单地做到这一点,

print(next(iter(tf_ds)))

to see the first element of the dataset and check if it matches the input expected by the model.查看数据集的第一个元素并检查它是否与模型预期的输入匹配。

There's a great tutorial about this, visit:有一个很棒的教程,请访问:

Simply:简单地:

   import numpy
   filename = 'pima-indians-diabetes.data.csv'
   raw_data = open(filename, 'rt')
   data = numpy.loadtxt(raw_data, delimiter=",")
   print(data.shape)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM