[英]how to do custom pre-processing on data when using tf.data?
I need some help help with tf.data.我需要一些关于 tf.data 的帮助。
I am doing a few experiments on SQUAD dataset.我正在对 SQUAD 数据集进行一些实验。 dataset structure given is like below:给出的数据集结构如下:
row-1] { conext: "some big string", question:"q string", "answer": "some ans" }
I would like to make use of tf.data for load and pre-processing .我想利用tf.data 进行加载和预处理。 After loading, it is loaded in foll.加载后,依次加载。 format:格式:
{
context: Tensor("some big string"),
question:Tensor(q string),
answer": Tensor(some ans)
}
Now we want to pre-process the data.现在我们要预处理数据。 Now here pre-processing is not straightforward because values are Tensor objects.现在这里的预处理并不简单,因为值是 Tensor 对象。
Tensorflow provides some apis for such kind of pre-processing but what if I want to do my custom pre-processing or maybe I want to use spacy which just operates on raw datatypes like string and not tensors. Tensorflow 为这种预处理提供了一些 api,但是如果我想做我的自定义预处理,或者我想使用 spacy,它只对原始数据类型(如字符串而不是张量)进行操作,该怎么办。
Basically I want help with this snippet:基本上我想要这个片段的帮助:
def format_data(row):
# Now I can access individual data row here. But value of row is in Tensor form.
# Hence I can't use my custom function. How to use custom function or spacy function which operates on string and not on tensor?
# I can use only below tf functions
return tf.strings.regex_replace(row['context'],'some-regex',' ',True)
train = dataset.map(format_data).batch(2)
ist(train.take(1))
Following code worked:以下代码有效:
def parse_str(str_tensor):
raw_string = str_tensor.numpy().decode("utf-8")
# play with raw string
raw_string = 'AAA'+raw_string
return raw_string
Call parse function:调用解析函数:
def tf_pre_processing(row):
return tf.py_function(parse_str, [row['context']], [tf.string])
train = t.map(tf_pre_processing).batch(1).take(1)
list(train)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.