简体   繁体   中英

Tensorflow Reading CSV - What's the best approach

So I've been trying out different ways of reading a CSV file with 97K lines and each line with 500 features (about 100 mb).

My first approach was to read all data into memory using a numpy:

raw_data = genfromtxt(filename, dtype=numpy.int32, delimiter=',')

This command had taken so long to run that I've needed to find a better way to read my file.

The second approach was to follow this guideline: https://www.tensorflow.org/programmers_guide/reading_data

The first thing I noticed is that every epoch takes so much longer to run. Since I'm using stochastic gradient descent, this can be explained because every batch needs to be read from the file

Is there a way to optimize this second approach?

My code (2nd approach):

reader = tf.TextLineReader()
filename_queue = tf.train.string_input_producer([filename])
_, csv_row = reader.read(filename_queue) # read one line
data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)

labels = data[0]
features = data[labelsSize:labelsSize+featuresSize]

# minimum number elements in the queue after a dequeue, used to ensure 
# that the samples are sufficiently mixed
# I think 10 times the BATCH_SIZE is sufficient
min_after_dequeue = 10 * batch_size

# the maximum number of elements in the queue
capacity = 20 * batch_size

# shuffle the data to generate BATCH_SIZE sample pairs
features_batch, labels_batch = tf.train.shuffle_batch([features, labels], batch_size=batch_size, num_threads=10, capacity=capacity, min_after_dequeue=min_after_dequeue)

* * * *

coordinator = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coordinator)

try:
 # And then after everything is built, start the training loop.
 for step in xrange(max_steps):
  global_step = step + offset_step
  start_time = time.time()

  # Run one step of the model.  The return values are the activations
  # from the `train_op` (which is discarded) and the `loss` Op.  To
  # inspect the values of your Ops or variables, you may include them
  # in the list passed to sess.run() and the value tensors will be
  # returned in the tuple from the call.
  _, __, loss_value, summary_str = sess.run([eval_op_train, train_op, loss_op, summary_op])

except tf.errors.OutOfRangeError:
  print('Done training -- epoch limit reached')
finally:
  coordinator.request_stop()

# Wait for threads to finish.
coordinator.join(threads)
sess.close()

A solution can be to convert the data in the tensorflow binary format using TFRecords .

See TensorFlow Data Input (Part 1): Placeholders, Protobufs & Queues

and to convert the CSV file to TFRecords look at this snippet:

csv = pandas.read_csv("your.csv").values
with tf.python_io.TFRecordWriter("csv.tfrecords") as writer:
    for row in csv:
        features, label = row[:-1], row[-1]
        example = tf.train.Example()
        example.features.feature["features"].float_list.value.extend(features)
        example.features.feature["label"].int64_list.value.append(label)
        writer.write(example.SerializeToString())

While to stream (very) large files from the local file system, on in a more real use case, from a remote storage like AWS S3, HDFS, etc. it could be helpful the Gensim smart_open python library:

    # stream lines from an S3 object
    for line in smart_open.smart_open('s3://mybucket/mykey.txt'):
           print line

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM