[英]A list and a tokenized sentence separated by quotation `'` with space and without space in python
I have a dataset, and by regex I extracted the data. 我有一个数据集,并通过正则表达式提取了数据。 I used the
sent_tokenize
method of NLTK to define for me the sentence boundary. 我使用NLTK的
sent_tokenize
方法为我定义了句子边界。
tok = sent_tokenize(str(all_text))
print(tok[0])
It give me this output:
# List of string
tok = ['Hi ' , hello at 'this ', there 'from ']
Now the annotated data that I have extracted from this dataset looks like: 现在,我从该数据集中提取的带注释的数据如下所示:
i = ['there' , 'hello', 'Hi']
If you see, in the tok list
the first quotation is with word and the closing quotation is with a space. 如果您看到的话,在
tok list
,第一个引号是单词,而结尾的引号是空格。 But in the i
list, an element of a list is closed with quotation without space. 但是在第
i
列表中,列表的元素是用引号引起来的,没有空格。 When I want to check if any element of i
in tok
, it should give me a result. 当我想检查
tok
中i
任何元素时,它应该给我一个结果。 but can not detect the text inside tok
. 但是无法检测到
tok
的文本。
这应该可以解决您的问题:
tok = [j.strip() for j in tok]
Im not sure why sent_tokenize
tokenizes each word in the sentence for you. 我不确定为什么
sent_tokenize
为您标记了句子中的每个单词。 but if you want tokens for each sentence, try something like this.. 但是,如果您希望每个句子都带有标记,请尝试这样的操作。
from nltk import PunktSentenceTokenizer, word_tokenize
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
tokens = [word_tokenize(i) for i in tokenizer.tokenize(all_text)]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.