“輸入無效。應該是字符串、字符串列表/元組或整數列表/元組。” ValueError：輸入無效

Question

我正在為法語使用 Bert 標記器，我收到了這個錯誤，但我似乎沒有解決它。 如果你有建議。


Traceback (most recent call last):
  File "training_cross_data_2.py", line 240, in <module>
    training_data(f, root, testdir, dict_unc)
  File "training_cross_data_2.py", line 107, in training_data
    Xtrain_emb, mdlname = get_flaubert_layer(data)
  File "training_cross_data_2.py", line 40, in get_flaubert_layer
    tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True)))
  File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/pandas/core/series.py", line 3848, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2329, in pandas._libs.lib.map_infer
  File "training_cross_data_2.py", line 40, in <lambda>
    tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True)))
  File "/home/anaconda3/envs/env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 907, in encode
    **kwargs,
  File "/home/anaconda3/envs/env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 1021, in encode_plus
    first_ids = get_input_ids(text)
  File "/home/anaconda3/envs/env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 1003, in get_input_ids
    "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

我環顧四周尋找喜歡的答案，但無論提出什么建議似乎都不起作用。 文本是 dataframe。

這里的代碼：

def get_flaubert_layer(texte): # teste is dataframe which I take from an excel file
    
    language_model_dir= os.path.expanduser(args.language_model_dir)
    lge_size = language_model_dir[16:-1]   # modify when on jean zay 27:-1
    print(lge_size)
    flaubert = FlaubertModel.from_pretrained(language_model_dir)
    flaubert_tokenizer = FlaubertTokenizer.from_pretrained(language_model_dir)
    tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True)))
    max_len = 0
    for i in tokenized.values:
        if len(i) > max_len:
            max_len = len(i)
    padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
    attention_mask = np.where(padded != 0, 1, 0)

我有另一個相同結構的文件，但它正在工作，但對於這種情況，我不知道為什么我會收到這個錯誤，我應該重新下載 model？

文件 kook 是這樣的：

在此處輸入圖像描述

Answer 1

您可能想要更改此行：

tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True)))

至

tokenized = flaubert_tokenizer.encode(texte["verbatim"], 
    add_special_tokens=True, 
    max_length=512, 
    truncation=True)`

這有兩個優點：

您沒有通過 pandas 行來標記 function （我猜這是導致您的錯誤的原因）。
您不會每行調用一次encode function 。 這可能會加速標記化。

“輸入無效。應該是字符串、字符串列表/元組或整數列表/元組。” ValueError：輸入無效

問題描述

1 個解決方案

解決方案1
0 2021-05-06 13:36:35

“輸入無效。 應該是字符串、字符串列表/元組或整數列表/元組。” ValueError：輸入無效

問題描述

1 個解決方案

解決方案1 0 2021-05-06 13:36:35

“輸入無效。應該是字符串、字符串列表/元組或整數列表/元組。” ValueError：輸入無效

解決方案1
0 2021-05-06 13:36:35