[英]splitting a list of sentences into separate words in a list
I have a list which consists of lines as我有一个由行组成的列表
lines = ['The query complexity of estimating weighted averages.',
'New bounds for the query complexity of an algorithm that learns',
'DFAs with correction equivalence queries.',
'general procedure to check conjunctive query containment.']
I need to store it in the list as 'Separate words'我需要将它作为“单独的词”存储在列表中
lines = ['The','query', 'complexity' ,'of' ,'estimating', 'weighted','averages.'
,'New' ......]
How to obtain it as a list of separate words?如何获得它作为单独的单词列表?
You can use a list comprehension :您可以使用列表理解:
>>> lines = [
... 'The query complexity of estimating weighted averages.',
... 'New bounds for the query complexity of an algorithm that learns',
... ]
>>> [word for line in lines for word in line.split()]
['The', 'query', 'complexity', 'of', 'estimating', 'weighted','averages.', 'New', 'bounds', 'for', 'the', 'query', 'complexity', 'of', 'an', 'algorithm', 'that', 'learns']
You can join all lines and then use split()
:您可以加入所有行,然后使用
split()
:
" ".join(lines).split()
or you can split each line and chain:或者您可以拆分每条线和链:
from itertools import chain
list(chain(*map(str.split, lines)))
You can do it by:你可以这样做:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
lines = ['The query complexity of estimating weighted averages.',
'New bounds for the query complexity of an algorithm that learns',
'DFAs with correction equivalence queries.',
'general procedure to check conjunctive query containment.']
joint_words = ' '.join(lines)
separated_words = word_tokenize(joint_words)
print(separated_words)
Output will be :输出将是:
['The', 'query', 'complexity', 'of', 'estimating', 'weighted', 'averages', '.', 'New', 'bounds', 'for', 'the', 'query', 'complexity', 'of', 'an', 'algorithm', 'that', 'learns', 'DFAs', 'with', 'correction', 'equivalence', 'queries', '.', 'general', 'procedure', 'to', 'check', 'conjunctive', 'query', 'containment', '.']
In addition, if you want to merge the dots with previous string (which appear as independent strings in the list), run the following code:此外,如果要将点与前一个字符串(在列表中显示为独立字符串)合并,请运行以下代码:
for i, j in enumerate(separated_words):
if '.' in j:
separated_words[i-1] = separated_words[i-1] + separated_words[i]
del separated_words[i] # For deleting duplicate entry
print(separated_words)
Output will be:输出将是:
['The', 'query', 'complexity', 'of', 'estimating', 'weighted', 'averages.', 'New', 'bounds', 'for', 'the', 'query', 'complexity', 'of', 'an', 'algorithm', 'that', 'learns', 'DFAs', 'with', 'correction', 'equivalence', 'queries.', 'general', 'procedure', 'to', 'check', 'conjunctive', 'query', 'containment.']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.