python中的列表和带标记的句子，用引号“'”分隔，带空格和不带空格

Question

I have a dataset, and by regex I extracted the data. 我有一个数据集，并通过正则表达式提取了数据。 I used the sent_tokenize method of NLTK to define for me the sentence boundary. 我使用NLTK的sent_tokenize方法为我定义了句子边界。

tok = sent_tokenize(str(all_text))
print(tok[0])
It give me this output:


# List of string 
tok = ['Hi ' ,  hello at 'this ', there 'from ']

Now the annotated data that I have extracted from this dataset looks like: 现在，我从该数据集中提取的带注释的数据如下所示：

i = ['there' , 'hello', 'Hi']

If you see, in the tok list the first quotation is with word and the closing quotation is with a space. 如果您看到的话，在tok list ，第一个引号是单词，而结尾的引号是空格。 But in the i list, an element of a list is closed with quotation without space. 但是在第i列表中，列表的元素是用引号引起来的，没有空格。 When I want to check if any element of i in tok , it should give me a result. 当我想检查tok中i任何元素时，它应该给我一个结果。 but can not detect the text inside tok . 但是无法检测到tok的文本。

Answer 1

这应该可以解决您的问题：

tok = [j.strip() for j in tok]

Answer 2

Im not sure why sent_tokenize tokenizes each word in the sentence for you. 我不确定为什么sent_tokenize为您标记了句子中的每个单词。 but if you want tokens for each sentence, try something like this.. 但是，如果您希望每个句子都带有标记，请尝试这样的操作。

from nltk import PunktSentenceTokenizer, word_tokenize
tokenizer = nltk.tokenize.punkt.PunktSentenceTokenizer()
tokens = [word_tokenize(i) for i in tokenizer.tokenize(all_text)]

python中的列表和带标记的句子，用引号“'”分隔，带空格和不带空格

问题描述

2 个解决方案

解决方案1
1 2018-10-26 12:51:34

解决方案2
0 2018-10-26 13:19:33

python中的列表和带标记的句子，用引号“&#39;”分隔，带空格和不带空格

问题描述

2 个解决方案

解决方案1 1 2018-10-26 12:51:34

解决方案2 0 2018-10-26 13:19:33

python中的列表和带标记的句子，用引号“'”分隔，带空格和不带空格

解决方案1
1 2018-10-26 12:51:34

解决方案2
0 2018-10-26 13:19:33