[英]Tensorflow Reading CSV - What's the best approach
So I've been trying out different ways of reading a CSV file with 97K lines and each line with 500 features (about 100 mb). 所以我一直在试用不同的方法来读取97K行的CSV文件,每行有500个功能(大约100mb)。
My first approach was to read all data into memory using a numpy: 我的第一种方法是使用numpy将所有数据读入内存:
raw_data = genfromtxt(filename, dtype=numpy.int32, delimiter=',') raw_data = genfromtxt(filename,dtype = numpy.int32,delimiter =',')
This command had taken so long to run that I've needed to find a better way to read my file. 这个命令运行了很长时间,我需要找到一个更好的方法来读取我的文件。
The second approach was to follow this guideline: https://www.tensorflow.org/programmers_guide/reading_data 第二种方法是遵循本指南: https : //www.tensorflow.org/programmers_guide/reading_data
The first thing I noticed is that every epoch takes so much longer to run. 我注意到的第一件事是,每个时代都需要更长的时间来运行。 Since I'm using stochastic gradient descent, this can be explained because every batch needs to be read from the file
由于我使用的是随机梯度下降,因此可以解释这一点,因为需要从文件中读取每个批次
Is there a way to optimize this second approach? 有没有办法优化第二种方法?
My code (2nd approach): 我的代码(第二种方法):
reader = tf.TextLineReader()
filename_queue = tf.train.string_input_producer([filename])
_, csv_row = reader.read(filename_queue) # read one line
data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)
labels = data[0]
features = data[labelsSize:labelsSize+featuresSize]
# minimum number elements in the queue after a dequeue, used to ensure
# that the samples are sufficiently mixed
# I think 10 times the BATCH_SIZE is sufficient
min_after_dequeue = 10 * batch_size
# the maximum number of elements in the queue
capacity = 20 * batch_size
# shuffle the data to generate BATCH_SIZE sample pairs
features_batch, labels_batch = tf.train.shuffle_batch([features, labels], batch_size=batch_size, num_threads=10, capacity=capacity, min_after_dequeue=min_after_dequeue)
* * * * * * * *
coordinator = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coordinator)
try:
# And then after everything is built, start the training loop.
for step in xrange(max_steps):
global_step = step + offset_step
start_time = time.time()
# Run one step of the model. The return values are the activations
# from the `train_op` (which is discarded) and the `loss` Op. To
# inspect the values of your Ops or variables, you may include them
# in the list passed to sess.run() and the value tensors will be
# returned in the tuple from the call.
_, __, loss_value, summary_str = sess.run([eval_op_train, train_op, loss_op, summary_op])
except tf.errors.OutOfRangeError:
print('Done training -- epoch limit reached')
finally:
coordinator.request_stop()
# Wait for threads to finish.
coordinator.join(threads)
sess.close()
A solution can be to convert the data in the tensorflow
binary format using TFRecords
. 解决方案可以是使用
TFRecords
以tensorflow
二进制格式转换数据。
See TensorFlow Data Input (Part 1): Placeholders, Protobufs & Queues 请参阅TensorFlow数据输入(第1部分):占位符,Protobufs和队列
and to convert the CSV file to TFRecords
look at this snippet: 要将CSV文件转换为
TFRecords
查看以下代码段:
csv = pandas.read_csv("your.csv").values
with tf.python_io.TFRecordWriter("csv.tfrecords") as writer:
for row in csv:
features, label = row[:-1], row[-1]
example = tf.train.Example()
example.features.feature["features"].float_list.value.extend(features)
example.features.feature["label"].int64_list.value.append(label)
writer.write(example.SerializeToString())
While to stream (very) large files from the local file system, on in a more real use case, from a remote storage like AWS S3, HDFS, etc. it could be helpful the Gensim smart_open python library: 虽然要从本地文件系统中流式传输(非常)大型文件,在更实际的用例中,从AWS S3,HDFS等远程存储器中流式传输,但Gensim smart_open python库可能会有所帮助:
# stream lines from an S3 object
for line in smart_open.smart_open('s3://mybucket/mykey.txt'):
print line
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.