简体   繁体   English

Bert 标记错误 ValueError:输入 nan 无效。 应该是字符串、字符串列表/元组或整数列表/元组

[英]Bert Tokenizing error ValueError: Input nan is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers

I am using the Bert for text classification task , when I try to tokenize one data sample using the code:我正在使用 Bert 进行文本分类任务,当我尝试使用以下代码标记一个数据样本时:

encoded_sent = tokenizer.encode(
                        sentences[7],                       
                        add_special_tokens = True)

it goes well but when ever i try to tokenize the whole data using the code:一切顺利,但是当我尝试使用代码标记整个数据时:

# For every sentence...
for sent in sentences:
    
    encoded_sent = tokenizer.encode(
                        sent,                       
                        add_special_tokens = True)

it gives me the error:它给了我错误:

"ValueError: Input nan is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."

I tried in English data that was successfully tokenized by someone and I get the same error.我尝试了由某人成功标记的英文数据,但遇到了同样的错误。 This is how i load my data:这是我加载数据的方式:

import pandas as pd

df=pd.read_csv("/content/DATA.csv",header=0,dtype=str)
DATA_COLUMN = 'sentence'
LABEL_COLUMN = 'label'
df.columns = [DATA_COLUMN, LABEL_COLUMN]

df["sentence"].head

and this is how i load the tokenizer:这就是我加载标记器的方式:

# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = AutoTokenizer.from_pretrained('aubmindlab/bert-base-arabert')

a sample of my data:我的数据样本:

Original: مساعد نائب رئيس المنزل: لم نر حتى رسالة كومي حتى غردها جيسون تشافيتز原文: مساعد نائب رئيس المنزل: لم نر حتى رسالة كومي حتى غردها جيسون تشافيتز

Tokenized: ['مساعد', 'نائب', 'رئيس', 'ال', '##منزل', ':', 'لم', 'نر', 'حتى', 'رسال', '##ة', 'كومي', 'حتى', 'غرد', '##ها', 'جيسون', 'تشافي', '##ت', '##ز']标记化:['مساعد', 'نائب', 'رئيس', 'ال', '##منزل', ':', 'لم', 'نر', 'حتى', 'رسال', '##ة' , 'كومي', 'حتى', 'غرد', '##ها', 'جيسون', 'تشافي', '##ت', '##ز']

any suggestions please?!请问有什么建议吗?!

It seems like your data contain NAN values, to surpass this issue, you have to eliminate NAN values or transform all data to string (local solution).您的数据似乎包含 NAN 值,要解决此问题,您必须消除 NAN 值或将所有数据转换为字符串(本地解决方案)。

Try using:尝试使用:

encoded_sent = tokenizer.encode(
        str(sent),                       
        add_special_tokens = True)

If you're sure that the dataset doesn't count NAN values you might use that solution, or to detect if your dataset contain NAN values you might use:如果您确定数据集不计算 NAN 值,您可以使用该解决方案,或者检测您的数据集是否包含您可能使用的 NAN 值:

for sent in sentences: 
    print(sent) 
    encoded_sent = tokenizer.encode( sent, add_special_tokens = True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM