简体   繁体   English

如何从开头和结尾删除空格? 分词器(split='[.??]')

[英]How to remove whitespace from start and end? Tokenizer(split='[.!?]')

tokenizer = Tokenizer(split='[.!?]') #create a token based on sentences
tokenizer.fit_on_texts(df['Cleaned'].values)
X_data = tokenizer.texts_to_sequences(df['Cleaned'].values)
X_sequ = pad_sequences(X_data)

I got the list of tokens whith whitespace, like blow #let see the frist 10 of our text sequences我得到了带有空格的标记列表,比如 blow #let see the first 10 of our text sequences

list(tokenizer.word_index)[:10]

Output like this below Output 像下面这样

[' rnfbdhl yis',
 ' oromoon bilisoomsiteeti jirti',
 ' namni oromummaa isaatiin mataa gadi qabtee deemu hin jiru yeroo ammaa tanatti']

how i can remove whitespace from start and end automatically?如何自动从开始和结束中删除空格? please help?请帮忙?

You can try regular expression using re in python.您可以在 python 中使用 re 尝试正则表达式。 So here it goes, The caret symbol tells us the beginning of a string, the dollar is the end of string, \s+ means one or more spaces.所以在这里,插入符号告诉我们字符串的开头,美元是字符串的结尾,\s+ 表示一个或多个空格。 So the regular expression means, replace every space(one or more) at the beginning or end of a string with ''(nothing).所以正则表达式的意思是,用''(无)替换字符串开头或结尾的每个空格(一个或多个)。 Let me know if it has worked for you.让我知道它是否对您有用。

import re
[re.sub(r'^\s+|\s+$', '', item) for item in list(tokenizer.word_index)]

With given string:使用给定的字符串:

x = [' rnfbdhl yis', ' oromoon bilisoomsiteeti jirti', ' namni oromummaa isaatiin mataa gadi qabtee deemu hin jiru yeroo ammaa tanatti']

Output : Output

[re.sub(r'^\s+|\s+$', '', item) for item in x]
['rnfbdhl yis', 'oromoon bilisoomsiteeti jirti', 'namni oromummaa isaatiin mataa gadi qabtee deemu hin jiru yeroo ammaa tanatti']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM