简体   繁体   中英

Apply feature columns without tf.Estimator (Tensorflow 2.0.0-rc0)

In the Tensorflow tf.Estimator and tf.feature_column docs it is well documented, how to use feature columns together with an Estimator eg in order to one-hot encode the categorical features in the dataset being used.

However, I want to "apply" my feature columns directly to a tf.dataset which I create from a .csv file (with two columns: UserID, MovieID), without even defining a model or an Estimator. (Reason: I want to check what's happening exactly in my datapipeline, ie I'd like to be able to run a batch of samples through my the pipeline, and then see in the output how the features got encoded.)

This is what I have tried so far:

column_names = ['UserID', 'MovieID']

user_col = tf.feature_column.categorical_column_with_hash_bucket(key='UserID', hash_bucket_size=1000)
movie_col = tf.feature_column.categorical_column_with_hash_bucket(key='MovieID', hash_bucket_size=1000)
feature_columns = [tf.feature_column.indicator_column(user_col), tf.feature_column.indicator_column(movie_col)]

feature_layer = tf.keras.layers.DenseFeatures(feature_columns=feature_columns)

def process_csv(line):
  fields = tf.io.decode_csv(line, record_defaults=[tf.constant([], dtype=tf.int32)]*2, field_delim=";")
  features = dict(zip(column_names, fields))

  return features 

ds = tf.data.TextLineDataset(csv_filepath)
ds = ds.map(process_csv, num_parallel_calls=4)
ds = ds.batch(10)
ds.map(lambda x: feature_layer(x))

However the last line with the map call raises the following error:

ValueError: Column dtype and SparseTensors dtype must be compatible. key: MovieID, column dtype: , tensor dtype:

I'm not sure what this error means... I also tried to define a tf.keras model with only the feature_layer I defined, and then run .predict() on my dataset - instead of using ds.map(lambda x: feature_layer(x)):

model = tf.keras.Sequential([feature_layer])
model.compile()
model.predict(ds)

However, this results exactly in the same error as above. Does anybody have an idea what is going wrong? Is there maybe an easier way to achieve this?

Just found the issue: tf.feature_column.categorical_column_with_hash_bucket() takes an optional argument dtype, which is set to tf.dtypes.string by default. However, the datatype of my columns is numerical (tf.dtypes.int32). This solved the issue:

tf.feature_column.categorical_column_with_hash_bucket(key='UserID', hash_bucket_size=1000, dtype=tf.dtypes.int32)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM