I have a relatively small dataset that I load into memory using a pandas DataFrame. I'd like to feed this data to a tensorflow model using batching, while maintaining support for sparse (categorical) columns. I'd also like to avoid having to serialize my data to disk in some other format. Although this doesn't seem too complicated I couldn't find a good example in the docs and had a pretty tough time designing a suitable input_fn
myself.
An toy example dataset would be:
df = pd.DataFrame(np.random.randint(1, 4, [7, 3]), columns=['c0', 'c1', 'c2'])
df['c1'] = df['c1'].astype(str) + 'g'
df['c2'] = (df['c2'] > 2.5).astype(int)
>>> df
c0 c1 c2
0 3 3g 1
1 1 1g 0
2 1 2g 0
3 2 2g 1
4 2 3g 0
5 1 3g 0
6 3 1g 0
where c0
is a dense numeric column, c1
is a categorical column, and c2
is a binary label column.
My solution is below, anything prettier and/or more efficient would be great.
Here's my (OP) solution. The conversion and serialization step is pretty slow (about 3 seconds per 1000 samples). Anything more efficient would be greatly appreciated.
import tensorflow as tf
######################################
# Define Feature conversion functions
######################################
def int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[int(value)]))
def float_feature(value):
return tf.train.Feature(float_list=tf.train.FloatList(value=[float(value)]))
def bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[str(value)]))
####################################################
# Define tensorflow data feed from pandas DataFrame
####################################################
def input_fn(df, label_col_name, int_col_names, float_col_names, cat_col_names, num_epochs, batch_size, shuffle=False):
# Define new column groups
feature_col_names = int_col_names + float_col_names + cat_col_names
all_col_names = [label_col_name] + feature_col_names
# Create conversion and parser dicts
converters = {}
parse_dict = {}
for col in all_col_names:
if col in cat_col_names:
converters[col] = bytes_feature
parse_dict[col] = tf.VarLenFeature(tf.string)
elif col in float_col_names:
converters[col] = float_feature
parse_dict[col] = tf.FixedLenFeature([], tf.float32)
elif col in int_col_names + [label_col_name]:
converters[col] = int64_feature
parse_dict[col] = tf.FixedLenFeature([], tf.int64)
# Convert DataFrame rows to feature Examples, serialize examples to string
serialized_examples = []
for record in df[all_col_names].to_dict('records'):
feat_record = {k: converters[k](v) for k,v in record.iteritems()}
example = tf.train.Example(features=tf.train.Features(feature=feat_record))
serialized_examples.append(example.SerializeToString())
# Create input queue
example_queue = tf.train.slice_input_producer([serialized_examples], num_epochs=num_epochs, shuffle=shuffle)
# Create batch
example_batch = tf.train.batch(example_queue, batch_size=batch_size, capacity=30, allow_smaller_final_batch=True)
# Parse batch
parsed_example_batch = tf.parse_example(example_batch, parse_dict)
# Split into features and label
feature_batch = {k: parsed_example_batch[k] for k in feature_col_names}
label_batch = parsed_example_batch[label_col_name]
return feature_batch, label_batch
example usage:
import functools
import numpy as np
import pandas as pd
# Create toy dataset
df = pd.DataFrame(np.random.randint(1, 4, [7, 3]), columns=['c0', 'c1', 'c2'])
df['c1'] = df['c1'].astype(str) + 'g'
df['c2'] = (df['c2'] > 2.5).astype(int)
# Specify feature names
cat_feats = ['c1']
float_feats = []
int_feats = ['c0']
label_feat = 'c2'
# Create parameterless input function
epochs = 3
batch_size = 2
input_fn_train = functools.partial(input_fn, df, label_feat, int_feats, float_feats, cat_feats, epochs, batch_size)
# Define features
continuous_features = [tf.contrib.layers.real_valued_column(feat) for feat in float_feats+int_feats]
categorical_features = [tf.contrib.layers.sparse_column_with_hash_bucket(feat, hash_bucket_size=1000) for feat in cat_feats]
features = continuous_features + categorical_features
# Create and fit model
model = tf.contrib.learn.LinearClassifier(feature_columns=features)
model.fit(input_fn=input_fn_train, steps=1000)
Firstly, why are you serializing these to tf.Examples, then deserializing them with parse_example? Serializing them, then batching them, then deserializing them, does unnecessary work. The standard way to define input functions in tensorflow is with tf.data . For tf.data
+ tf.Estimators
, this documentation might be helpful. And in this case specifically, this code should work:
def input_fn(df, label_feat, num_epochs, batch_size, shuffle=False):
# Each element of dataset is one row of the dataframe
dataset = tf.data.Dataset.from_tensor_slices(dict(df))
def map_fn(element, label_feat):
# element is a {'c0': int, 'c1': str, 'c2': int} dictionary
label = element.pop(label_feat)
return (element, label)
if shuffle:
dataset = dataset.shuffle(shuffle_buffer_size)
# Batch the elements of the dataset
dataset = dataset.batch(batch_size)
# Repeat the dataset for num_epochs
dataset = dataset.repeat(num_epochs)
# Split it into features, label tuple
dataset = dataset.map(lambda elem: map_fn(elem, label_feat)
# One shot iterator iterates through the (repeated) dataset once,
# yielding feature_batch, label_batch
iterator = dataset.make_one_shot_iterator()
feature_batch, label_batch = iterator.get_next()
return feature_batch, label_batch
Additionally, based on your code, it seems like there might be some confusion here about SparseTensors, vs sparse_column. When you used tf.VarLenFeature
, the feature is parsed as a SparseTensor, and this is only necessary when the feature being parsed has variable shape. In this case, your c1
features are all scalar string tensors, so FixedLenFeature
should work for this, there is no need for the feature to be represented as sparse tensors, even if they eventually get represented as a sparse_column
. This documentation tells you more about sparse columns.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.