简体   繁体   English

Tensorflow读取CSV - 什么是最好的方法

[英]Tensorflow Reading CSV - What's the best approach

So I've been trying out different ways of reading a CSV file with 97K lines and each line with 500 features (about 100 mb). 所以我一直在试用不同的方法来读取97K行的CSV文件,每行有500个功能(大约100mb)。

My first approach was to read all data into memory using a numpy: 我的第一种方法是使用numpy将所有数据读入内存:

raw_data = genfromtxt(filename, dtype=numpy.int32, delimiter=',') raw_data = genfromtxt(filename,dtype = numpy.int32,delimiter =',')

This command had taken so long to run that I've needed to find a better way to read my file. 这个命令运行了很长时间,我需要找到一个更好的方法来读取我的文件。

The second approach was to follow this guideline: https://www.tensorflow.org/programmers_guide/reading_data 第二种方法是遵循本指南: https//www.tensorflow.org/programmers_guide/reading_data

The first thing I noticed is that every epoch takes so much longer to run. 我注意到的第一件事是,每个时代都需要更长的时间来运行。 Since I'm using stochastic gradient descent, this can be explained because every batch needs to be read from the file 由于我使用的是随机梯度下降,因此可以解释这一点,因为需要从文件中读取每个批次

Is there a way to optimize this second approach? 有没有办法优化第二种方法?

My code (2nd approach): 我的代码(第二种方法):

reader = tf.TextLineReader()
filename_queue = tf.train.string_input_producer([filename])
_, csv_row = reader.read(filename_queue) # read one line
data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)

labels = data[0]
features = data[labelsSize:labelsSize+featuresSize]

# minimum number elements in the queue after a dequeue, used to ensure 
# that the samples are sufficiently mixed
# I think 10 times the BATCH_SIZE is sufficient
min_after_dequeue = 10 * batch_size

# the maximum number of elements in the queue
capacity = 20 * batch_size

# shuffle the data to generate BATCH_SIZE sample pairs
features_batch, labels_batch = tf.train.shuffle_batch([features, labels], batch_size=batch_size, num_threads=10, capacity=capacity, min_after_dequeue=min_after_dequeue)

* * * * * * * *

coordinator = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coordinator)

try:
 # And then after everything is built, start the training loop.
 for step in xrange(max_steps):
  global_step = step + offset_step
  start_time = time.time()

  # Run one step of the model.  The return values are the activations
  # from the `train_op` (which is discarded) and the `loss` Op.  To
  # inspect the values of your Ops or variables, you may include them
  # in the list passed to sess.run() and the value tensors will be
  # returned in the tuple from the call.
  _, __, loss_value, summary_str = sess.run([eval_op_train, train_op, loss_op, summary_op])

except tf.errors.OutOfRangeError:
  print('Done training -- epoch limit reached')
finally:
  coordinator.request_stop()

# Wait for threads to finish.
coordinator.join(threads)
sess.close()

A solution can be to convert the data in the tensorflow binary format using TFRecords . 解决方案可以是使用TFRecordstensorflow二进制格式转换数据。

See TensorFlow Data Input (Part 1): Placeholders, Protobufs & Queues 请参阅TensorFlow数据输入(第1部分):占位符,Protobufs和队列

and to convert the CSV file to TFRecords look at this snippet: 要将CSV文件转换为TFRecords查看以下代码段:

csv = pandas.read_csv("your.csv").values
with tf.python_io.TFRecordWriter("csv.tfrecords") as writer:
    for row in csv:
        features, label = row[:-1], row[-1]
        example = tf.train.Example()
        example.features.feature["features"].float_list.value.extend(features)
        example.features.feature["label"].int64_list.value.append(label)
        writer.write(example.SerializeToString())

While to stream (very) large files from the local file system, on in a more real use case, from a remote storage like AWS S3, HDFS, etc. it could be helpful the Gensim smart_open python library: 虽然要从本地文件系统中流式传输(非常)大型文件,在更实际的用例中,从AWS S3,HDFS等远程存储器中流式传输,Gensim smart_open python库可能会有所帮助:

    # stream lines from an S3 object
    for line in smart_open.smart_open('s3://mybucket/mykey.txt'):
           print line

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 这种数据库模型结构的最佳方法是什么? - What's the best approach for this database models structure? 最好的 Python 字符串拆分方法是什么? - What's the best Python string split approach? Tensorflow的decode_csv仅读取一行 - Tensorflow's decode_csv only reading one line 对于 Elasticsearch 和 RabbitMQ,将数据导入 S3 的最佳方法是什么? - What is the best approach to getting data into S3 for Elasticsearch and RabbitMQ? 推迟类属性初始化的最佳方法是什么? - What's the best approach to defer class attribute initialization? 在Celery中使用全局变量:最佳方法是什么? - Using global variables in Celery: What's the best approach? 在 tensorflow 中实现 3D 卷积的最佳方法是什么? - What's the best way to implement 3D convolution in tensorflow? Tensorflow:从问题中获取手册部分的最佳实践是什么? - Tensorflow: What's the best practice to get a section of a manual from a question? 在 TensorFlow 中批量访问单个渐变的最佳方法是什么? - What's the best way to access single gradients in a batch in TensorFlow? Elasticsearch 分页的最佳方法是什么? - What is the best approach for Elasticsearch pagination?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM