简体   繁体   English

如何将大数据集从 CSV 加载到 keras

[英]How to load big dataset from CSV into keras

I'm trying to use Keras with TensorFlow to train a network based on the SURF features that I obtained from several images.我正在尝试将 Keras 与 TensorFlow 结合使用,以根据我从多张图像中获得的 SURF 特征来训练网络。 I have all this features stored in a CSV file that has the following columns:我将所有这些功能存储在一个包含以下列的 CSV 文件中:

 [ID, Code, PointX, PointY, Desc1, ..., Desc64]

The "ID" column is an autoincremental index created by pandas when I store all the values. “ID”列是我存储所有值时由熊猫创建的自动增量索引。 The "Code" column is the label of the point, this would be just a number that I got by pairing the actual code (which is a string) with a number. “代码”列是点的标签,这只是我通过将实际代码(它是一个字符串)与一个数字配对得到的一个数字。 "PointX/Y" are the coordinates of the point found in an image of a given class, and "Desc#" is the float value of the corresponding descriptor of that point. “PointX/Y”是在给定类的图像中找到的点的坐标,“Desc#”是该点对应描述符的浮点值。

The CSV file contains all the KeyPoints and Descriptors found in all 20.000 images. CSV 文件包含在所有 20.000 个图像中找到的所有关键点和描述符。 This gives me a total size of almost 60GB in disk, which I obviously can't fit into memory.这使我的磁盘总大小接近 60GB,显然我无法放入内存中。

I've been trying to load batches of the file using pandas, then put all the values in a numpy array, and then fitting my model (a Sequential model of only 3 layers).我一直在尝试使用 Pandas 批量加载文件,然后将所有值放在一个 numpy 数组中,然后拟合我的模型(只有 3 层的序列模型)。 I've used the following code to do so:我使用以下代码来做到这一点:

chunksize = 10 ** 6
for chunk in pd.read_csv("surf_kps.csv", chunksize=chunksize):
    dataset_chunk = chunk.to_numpy(dtype=np.float32, copy=False)
    # Divide dataset in data and labels
    X = dataset_chunk[:,9:]
    Y = dataset_chunk[:,1]
    # Train model
    # Evaluate model
    scores = model.evaluate(X, Y)
    print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

This is alright with the first chunk loaded, but when the loop gets another chunk, accuracy and loss stuck on 0.这在加载第一个块时没问题,但是当循环获取另一个块时,准确性和损失卡在 0 上。

Is it wrong the way I'm trying to load all this information?我试图加载所有这些信息的方式是错误的吗?

Thanks in advance!提前致谢!

------ EDIT ------ - - - 编辑 - - -

Ok, now I made a simple generator like this:好的,现在我做了一个简单的生成器,如下所示:

def read_csv(filename):
    with open(filename, 'r') as f:
        for line in f.readlines():
            record = line.rstrip().split(',')
            features = [np.float32(n) for n in record[9:73]]
            label = int(record[1])
            print("features: ",type(features[0]), " ", type(label))
            yield np.array(features), label

and use fit_generator with it:并使用 fit_generator :

tf_ds = read_csv("mini_surf_kps.csv")

I don't know why, but I keep getting an error just before the first epoch starts:我不知道为什么,但在第一个纪元开始之前我一直收到错误消息:

ValueError: Error when checking input: expected dense_input to have shape (64,) but got array with shape (1,)

The first layer of the model has input_dim=64 and the shape of the features array yielded is also 64.模型的第一层input_dim=64 ,生成的特征数组的形状也是 64。

If you are using Tf 2.0, you could verify if the contents of the dataset are right.如果您使用的是 Tf 2.0,您可以验证数据集的内容是否正确。 You can simply do this by ,你可以简单地做到这一点,


to see the first element of the dataset and check if it matches the input expected by the model.查看数据集的第一个元素并检查它是否与模型预期的输入匹配。

There's a great tutorial about this, visit:有一个很棒的教程,请访问:


   import numpy
   filename = 'pima-indians-diabetes.data.csv'
   raw_data = open(filename, 'rt')
   data = numpy.loadtxt(raw_data, delimiter=",")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM