简体   繁体   中英

Tensorflow - input from DataFrame using batching and sparse/categorical data

I have a relatively small dataset that I load into memory using a pandas DataFrame. I'd like to feed this data to a tensorflow model using batching, while maintaining support for sparse (categorical) columns. I'd also like to avoid having to serialize my data to disk in some other format. Although this doesn't seem too complicated I couldn't find a good example in the docs and had a pretty tough time designing a suitable input_fn myself.

An toy example dataset would be:

df = pd.DataFrame(np.random.randint(1, 4, [7, 3]), columns=['c0', 'c1', 'c2'])
df['c1'] = df['c1'].astype(str) + 'g'
df['c2'] = (df['c2'] > 2.5).astype(int)

>>> df
   c0  c1  c2
0   3  3g   1
1   1  1g   0
2   1  2g   0
3   2  2g   1
4   2  3g   0
5   1  3g   0
6   3  1g   0

where c0 is a dense numeric column, c1 is a categorical column, and c2 is a binary label column.

My solution is below, anything prettier and/or more efficient would be great.

Here's my (OP) solution. The conversion and serialization step is pretty slow (about 3 seconds per 1000 samples). Anything more efficient would be greatly appreciated.

import tensorflow as tf

######################################
# Define Feature conversion functions
######################################
def int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[int(value)]))

def float_feature(value):
    return tf.train.Feature(float_list=tf.train.FloatList(value=[float(value)]))

def bytes_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[str(value)]))

####################################################
# Define tensorflow data feed from pandas DataFrame
####################################################
def input_fn(df, label_col_name, int_col_names, float_col_names, cat_col_names, num_epochs, batch_size, shuffle=False):
    # Define new column groups
    feature_col_names = int_col_names + float_col_names + cat_col_names
    all_col_names = [label_col_name] + feature_col_names

    # Create conversion and parser dicts
    converters = {}
    parse_dict = {}
    for col in all_col_names:
        if col in cat_col_names:
            converters[col] = bytes_feature
            parse_dict[col] = tf.VarLenFeature(tf.string)
        elif col in float_col_names:
            converters[col] = float_feature
            parse_dict[col] = tf.FixedLenFeature([], tf.float32)
        elif col in int_col_names + [label_col_name]:
            converters[col] = int64_feature
            parse_dict[col] = tf.FixedLenFeature([], tf.int64)

    # Convert DataFrame rows to feature Examples, serialize examples to string
    serialized_examples = []
    for record in df[all_col_names].to_dict('records'):
        feat_record = {k: converters[k](v) for k,v in record.iteritems()}
        example = tf.train.Example(features=tf.train.Features(feature=feat_record))
        serialized_examples.append(example.SerializeToString())

    # Create input queue
    example_queue = tf.train.slice_input_producer([serialized_examples], num_epochs=num_epochs, shuffle=shuffle)

    # Create batch
    example_batch = tf.train.batch(example_queue, batch_size=batch_size, capacity=30, allow_smaller_final_batch=True)

    # Parse batch
    parsed_example_batch = tf.parse_example(example_batch, parse_dict)

    # Split into features and label
    feature_batch = {k: parsed_example_batch[k] for k in feature_col_names}
    label_batch = parsed_example_batch[label_col_name]

    return feature_batch, label_batch

example usage:

import functools
import numpy as np
import pandas as pd

# Create toy dataset
df = pd.DataFrame(np.random.randint(1, 4, [7, 3]), columns=['c0', 'c1', 'c2'])
df['c1'] = df['c1'].astype(str) + 'g'
df['c2'] = (df['c2'] > 2.5).astype(int)

# Specify feature names
cat_feats = ['c1']
float_feats = []
int_feats = ['c0']
label_feat = 'c2'

# Create parameterless input function
epochs = 3
batch_size = 2
input_fn_train = functools.partial(input_fn, df, label_feat, int_feats, float_feats, cat_feats, epochs, batch_size)

# Define features
continuous_features = [tf.contrib.layers.real_valued_column(feat) for feat in float_feats+int_feats]
categorical_features = [tf.contrib.layers.sparse_column_with_hash_bucket(feat, hash_bucket_size=1000) for feat in cat_feats]
features = continuous_features + categorical_features

# Create and fit model
model = tf.contrib.learn.LinearClassifier(feature_columns=features)
model.fit(input_fn=input_fn_train, steps=1000)

Firstly, why are you serializing these to tf.Examples, then deserializing them with parse_example? Serializing them, then batching them, then deserializing them, does unnecessary work. The standard way to define input functions in tensorflow is with tf.data . For tf.data + tf.Estimators , this documentation might be helpful. And in this case specifically, this code should work:

def input_fn(df, label_feat, num_epochs, batch_size, shuffle=False):
  # Each element of dataset is one row of the dataframe
  dataset = tf.data.Dataset.from_tensor_slices(dict(df))

  def map_fn(element, label_feat):
    # element is a {'c0': int, 'c1': str, 'c2': int} dictionary
    label = element.pop(label_feat)
    return (element, label)

  if shuffle:
    dataset = dataset.shuffle(shuffle_buffer_size)

  # Batch the elements of the dataset
  dataset = dataset.batch(batch_size)
  # Repeat the dataset for num_epochs
  dataset = dataset.repeat(num_epochs)

  # Split it into features, label tuple
  dataset = dataset.map(lambda elem: map_fn(elem, label_feat)

  # One shot iterator iterates through the (repeated) dataset once, 
  # yielding feature_batch, label_batch
  iterator = dataset.make_one_shot_iterator()
  feature_batch, label_batch = iterator.get_next()
  return feature_batch, label_batch

Additionally, based on your code, it seems like there might be some confusion here about SparseTensors, vs sparse_column. When you used tf.VarLenFeature , the feature is parsed as a SparseTensor, and this is only necessary when the feature being parsed has variable shape. In this case, your c1 features are all scalar string tensors, so FixedLenFeature should work for this, there is no need for the feature to be represented as sparse tensors, even if they eventually get represented as a sparse_column . This documentation tells you more about sparse columns.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM