[英]What is the difference between giving a string and a list of string(s) to keras tokenizer?
I am working with keras.preprocessing
for tokenize sentences, I encountered an unexpected case in keras.preprocessing.text.Tokenize
.我正在使用keras.preprocessing
来标记句子,我在keras.preprocessing.text.Tokenize
遇到了一个意外情况。 When I give it string, the output of word_index
is a dictionary of single characters and their indexes but for list the output of word_index
is dictionary of words (spllited by space).当我给它字符串时, word_index
的输出是单个字符及其索引的字典,但对于列表, word_index
的输出是单词字典(由空格分割)。
Why this happen?为什么会发生这种情况?
String for tokenizer input:分词器输入的字符串:
from keras.preprocessing.text import Tokenizer
text = "Keras is a deep learning and neural networks API by François Chollet"
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text) #input of tokenizer as string
print(tokenizer.word_index)
>>> {'e': 1, 'a': 2, 'n': 3, 'r': 4, 's': 5, 'i': 6, 'l': 7, 'o': 8, 'k': 9, 'd': 10, 'p': 11, 't': 12, 'g': 13,
'u': 14, 'w': 15, 'b': 16, 'y': 17, 'f': 18, 'ç': 19, 'c': 20, 'h': 21}
List for tokenizer input:分词器输入列表:
from keras.preprocessing.text import Tokenizer
text = ["Keras is a deep learning and neural networks API by François Chollet"]
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text) #input of tokenizer as list
print(tokenizer.word_index)
>>> {'keras': 1, 'is': 2, 'a': 3, 'deep': 4, 'learning': 5, 'and': 6, 'neural': 7, 'networks': 8,
'api': 9, 'by': 10, 'françois': 11, 'chollet': 12}
The docs state to use a list of strings or a list of list of strings.文档声明使用字符串列表或字符串列表列表。 There is no mention of whether you are allowed to pass a string as input, so it's possible that what you're doing is undefined behaviour that isn't getting caught.没有提到是否允许您将字符串作为输入传递,因此您所做的可能是未定义的行为,没有被捕获。
When you pass a string as input, it looks like Keras interprets it to be a character level tokenization.当您将字符串作为输入传递时,Keras 似乎将其解释为字符级标记化。 Either way, if you wanted to perform a character level tokenization, it's much better to pass char_level=True
when you are instantiating the Tokenizer
class.无论哪种方式,如果您想执行字符级标记化,最好在实例化Tokenizer
类时传递char_level=True
。
TL;DR: Don't pass a string. TL;DR:不要传递字符串。 The docs don't mention it as a legal argument. 文档没有将其作为法律论据提及。 There exists a legal way of performing character level tokenization.存在执行字符级标记化的合法方法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.