[英]How to load big dataset from CSV into keras
I'm trying to use Keras with TensorFlow to train a network based on the SURF features that I obtained from several images.我正在尝试将 Keras 与 TensorFlow 结合使用,以根据我从多张图像中获得的 SURF 特征来训练网络。 I have all this features stored in a CSV file that has the following columns:
我将所有这些功能存储在一个包含以下列的 CSV 文件中:
[ID, Code, PointX, PointY, Desc1, ..., Desc64]
The "ID" column is an autoincremental index created by pandas when I store all the values. “ID”列是我存储所有值时由熊猫创建的自动增量索引。 The "Code" column is the label of the point, this would be just a number that I got by pairing the actual code (which is a string) with a number.
“代码”列是点的标签,这只是我通过将实际代码(它是一个字符串)与一个数字配对得到的一个数字。 "PointX/Y" are the coordinates of the point found in an image of a given class, and "Desc#" is the float value of the corresponding descriptor of that point.
“PointX/Y”是在给定类的图像中找到的点的坐标,“Desc#”是该点对应描述符的浮点值。
The CSV file contains all the KeyPoints and Descriptors found in all 20.000 images. CSV 文件包含在所有 20.000 个图像中找到的所有关键点和描述符。 This gives me a total size of almost 60GB in disk, which I obviously can't fit into memory.
这使我的磁盘总大小接近 60GB,显然我无法放入内存中。
I've been trying to load batches of the file using pandas, then put all the values in a numpy array, and then fitting my model (a Sequential model of only 3 layers).我一直在尝试使用 Pandas 批量加载文件,然后将所有值放在一个 numpy 数组中,然后拟合我的模型(只有 3 层的序列模型)。 I've used the following code to do so:
我使用以下代码来做到这一点:
chunksize = 10 ** 6
for chunk in pd.read_csv("surf_kps.csv", chunksize=chunksize):
dataset_chunk = chunk.to_numpy(dtype=np.float32, copy=False)
print(dataset_chunk)
# Divide dataset in data and labels
X = dataset_chunk[:,9:]
Y = dataset_chunk[:,1]
# Train model
model.fit(x=X,y=Y,batch_size=200,epochs=20)
# Evaluate model
scores = model.evaluate(X, Y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
This is alright with the first chunk loaded, but when the loop gets another chunk, accuracy and loss stuck on 0.这在加载第一个块时没问题,但是当循环获取另一个块时,准确性和损失卡在 0 上。
Is it wrong the way I'm trying to load all this information?我试图加载所有这些信息的方式是错误的吗?
Thanks in advance!提前致谢!
------ EDIT ------ - - - 编辑 - - -
Ok, now I made a simple generator like this:好的,现在我做了一个简单的生成器,如下所示:
def read_csv(filename):
with open(filename, 'r') as f:
for line in f.readlines():
record = line.rstrip().split(',')
features = [np.float32(n) for n in record[9:73]]
label = int(record[1])
print("features: ",type(features[0]), " ", type(label))
yield np.array(features), label
and use fit_generator with it:并使用 fit_generator :
tf_ds = read_csv("mini_surf_kps.csv")
model.fit_generator(tf_ds,steps_per_epoch=1000,epochs=20)
I don't know why, but I keep getting an error just before the first epoch starts:我不知道为什么,但在第一个纪元开始之前我一直收到错误消息:
ValueError: Error when checking input: expected dense_input to have shape (64,) but got array with shape (1,)
The first layer of the model has input_dim=64
and the shape of the features array yielded is also 64.模型的第一层
input_dim=64
,生成的特征数组的形状也是 64。
I think it is better to use tf.data.Dataset
, this may help:我认为最好使用
tf.data.Dataset
,这可能会有所帮助:
If you are using Tf 2.0, you could verify if the contents of the dataset are right.如果您使用的是 Tf 2.0,您可以验证数据集的内容是否正确。 You can simply do this by ,
你可以简单地做到这一点,
print(next(iter(tf_ds)))
to see the first element of the dataset and check if it matches the input expected by the model.查看数据集的第一个元素并检查它是否与模型预期的输入匹配。
There's a great tutorial about this, visit:有一个很棒的教程,请访问:
https://machinelearningmastery.com/load-machine-learning-data-python/ https://machinelearningmastery.com/load-machine-learning-data-python/
# Load CSV (using python) import csv import numpy filename = 'pima-indians-diabetes.data.csv' raw_data = open(filename, 'rt') reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE) x = list(reader) data = numpy.array(x).astype('float') print(data.shape)
Simply:简单地:
import numpy
filename = 'pima-indians-diabetes.data.csv'
raw_data = open(filename, 'rt')
data = numpy.loadtxt(raw_data, delimiter=",")
print(data.shape)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.