[英]How to detokenize words back to the original form in a list in Python
libOfSentences = ["Get help with the display",
"Display is not working properly", "I need some help"]
#removing stopwords
for i in libOfSentences:
sentence = word_tokenize(j) #tokenize each individual word
sentence = filter(lambda x: x not in string.punctuation, sentence)
cleaned_text = filter(lambda x: x not in stop_words, sentence)
removedStopwordsList = " ".join(cleaned_text)
removedStopwordsList
has now joined the sentences back together but I want to keep it in a list. removedStopwordsList
现在将句子重新组合在一起,但我想将其保留在列表中。 The desired output is like this: 所需的输出是这样的:
["Get help display", "Display not working properly", "I need some help"]
I want to have removedStopwordsList
still be a list I can loop through for example 我想
removedStopwordsList
仍然是我可以循环浏览的列表
removedStopwordsList[0]
gives me 给我
"G D I"
right now but I want removedStopwordsList[0]
现在,但是我想
removedStopwordsList[0]
to output 输出
"Get help display"
The join function is what is stopping this from occurring right now but I can't find a better workaround. 加入功能是阻止这种情况立即发生的方法,但是我找不到更好的解决方法。
I want to have removedStopwordsList still be a list
我想删除StopwordsList仍然是列表
Then just make it a list instead of making it a string: 然后,仅使其成为列表,而不是使其成为字符串:
removedStopwordsList = list(cleaned_text)
Although you can do this even more simply by using a list comprehension instead of calling filter
: 尽管您可以使用列表理解而不是调用
filter
来更简单地执行此操作:
removedStopwordsList = [x for x in sentence if x not in stop_words]
map
and filter
are great when you have a function you want to call on each element, but when you have an arbitrary expression, which you have to wrap up in lambda
to turn into a function call, it's simpler and more readable to just use a list comprehension or generator expression. 当您具有要在每个元素上调用的函数时,
map
和filter
很棒,但是当您具有任意表达式(必须将其包装在lambda
才能转换为函数调用)时,仅使用a就更简单易读列出理解或生成器表达式。
And you can similarly simplify the previous line. 您可以类似地简化上一行。 So:
所以:
for i in libOfSentences:
sentence = word_tokenize(j) #tokenize each individual word
sentence = (x for x in sentence if x not in string.punctuation)
removedStopwordsList = [x for x in sentence if x not in stop_words]
If you need to have the joined-up string around as well, that's fine; 如果您还需要连接字符串,那很好。 you can have a second variable:
您可以有另一个变量:
removedStopwordsString = " ".join(removedStopwordsList)
If you really want a single object that can behave both ways, it wouldn't be hard to write such a class, but it would just be ugly. 如果您真的想要一个可以同时运行的对象,那么编写这样的类并不难 ,但是这很丑陋。 And under the covers, it's just going to have a self.list_of_words and self.joined_string that it delegates to anyway.
而且在幕后,它只会拥有一个self.list_of_words和self.joined_string委托给它。 So, what would be the point?
那么,有什么意义呢?
At any rate, I doubt you need to keep the string around. 无论如何,我怀疑您是否需要保留字符串。 If you ever want to print it out, you can just
join
it on the fly: 如果您想打印出来,可以随时
join
:
print(" ".join(removedStopwordsList))
… or even expand it into separate printables: …甚至将其扩展为单独的可打印内容:
print(*removeStopwordsList)
If you're trying to gather all of those lists into one big list, you have to actually write code to do that. 如果您试图将所有这些列表收集到一个大列表中,则必须实际编写代码来做到这一点。 Obviously if you do
removeStopwordsList = <anything>
each time through the loop, you're just replacing it each time through. 显然,如果您在循环中每次都执行
removeStopwordsList = <anything>
,则每次都将其替换。 You need to append
that to some bigger list if you want to keep all those lists around. 如果要保留所有这些列表,则需要将其
append
到更大的列表中。 For example: 例如:
listOfLists = []
for i in libOfSentences:
sentence = word_tokenize(j) #tokenize each individual word
sentence = (x for x in sentence if x not in string.punctuation)
removedStopwordsList = [x for x in sentence if x not in stop_words]
listOfLists.append(removedStopwordsList)
And now, if you print out listOfLists
, it'll be a list of two lists of words; 现在,如果您打印出
listOfLists
,它将是两个单词列表的列表; listOfLists[0]
will be the first list; listOfLists[0]
将是第一个列表; listOfLists[0][0]
will b the first word of the first list; listOfLists[0][0]
将成为第一个列表的第一个单词; etc. 等等
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.