Apply feature columns without tf.Estimator (Tensorflow 2.0.0-rc0)

Question

In the Tensorflow tf.Estimator and tf.feature_column docs it is well documented, how to use feature columns together with an Estimator eg in order to one-hot encode the categorical features in the dataset being used.

However, I want to "apply" my feature columns directly to a tf.dataset which I create from a .csv file (with two columns: UserID, MovieID), without even defining a model or an Estimator. (Reason: I want to check what's happening exactly in my datapipeline, ie I'd like to be able to run a batch of samples through my the pipeline, and then see in the output how the features got encoded.)

This is what I have tried so far:

column_names = ['UserID', 'MovieID']

user_col = tf.feature_column.categorical_column_with_hash_bucket(key='UserID', hash_bucket_size=1000)
movie_col = tf.feature_column.categorical_column_with_hash_bucket(key='MovieID', hash_bucket_size=1000)
feature_columns = [tf.feature_column.indicator_column(user_col), tf.feature_column.indicator_column(movie_col)]

feature_layer = tf.keras.layers.DenseFeatures(feature_columns=feature_columns)

def process_csv(line):
  fields = tf.io.decode_csv(line, record_defaults=[tf.constant([], dtype=tf.int32)]*2, field_delim=";")
  features = dict(zip(column_names, fields))

  return features 

ds = tf.data.TextLineDataset(csv_filepath)
ds = ds.map(process_csv, num_parallel_calls=4)
ds = ds.batch(10)
ds.map(lambda x: feature_layer(x))

However the last line with the map call raises the following error:

ValueError: Column dtype and SparseTensors dtype must be compatible. key: MovieID, column dtype: , tensor dtype:

I'm not sure what this error means... I also tried to define a tf.keras model with only the feature_layer I defined, and then run .predict() on my dataset - instead of using ds.map(lambda x: feature_layer(x)):

model = tf.keras.Sequential([feature_layer])
model.compile()
model.predict(ds)

However, this results exactly in the same error as above. Does anybody have an idea what is going wrong? Is there maybe an easier way to achieve this?

Answer 1

Just found the issue: tf.feature_column.categorical_column_with_hash_bucket() takes an optional argument dtype, which is set to tf.dtypes.string by default. However, the datatype of my columns is numerical (tf.dtypes.int32). This solved the issue:

tf.feature_column.categorical_column_with_hash_bucket(key='UserID', hash_bucket_size=1000, dtype=tf.dtypes.int32)

Apply feature columns without tf.Estimator (Tensorflow 2.0.0-rc0)

Question

1 answers

solution1
0 ACCPTED 2019-09-06 01:49:37

Apply feature columns without tf.Estimator (Tensorflow 2.0.0-rc0)

Question

1 answers

solution1 0 ACCPTED 2019-09-06 01:49:37

solution1
0 ACCPTED 2019-09-06 01:49:37