简体   繁体   中英

Python :: NLTK concatenating list of sentences

NLTK http://www.nltk.org/ is a toolkit for computational linguistics.

I am trying to manipulate sentences, using the sents() method:

from nltk.corpus import gutenberg

it fetches texts by fileid :

hamlet = gutenberg.sents('shakespeare-hamlet.txt')

the output is:

print hamlet
[['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...]

But let's say I want to make a list of sentences by author instead of by book. In a repetitive way (it won't let me extend() lists):

shakespeare = []

hamlet = gutenberg.sents('shakespeare-hamlet.txt')
macbeth = gutenberg.sents('shakespeare-macbeth.txt')
caesar = gutenberg.sents('shakespeare-caesar.txt')

shakespeare.append(hamlet)
shakespeare.append(macbeth)
shakespeare.append(caesar)

but then it all becomes nested:

print shakespeare

[[['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...]]

Is there a way I can end up with ONE list with all concatenated sentences, not nested, like this?

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...]]

The best solution is to just fetch them all at once-- the sentences come the way you want them. The nltk's corpus readers accept either a single filename or a list of files:

shakespeare = gutenberg.sents(['shakespeare-hamlet.txt',
                 'shakespeare-macbeth.txt', 'shakespeare-caesar.txt'])

In other situations, if you have several lists and you want to concatenate them you should use extend() , not append() :

shakespeare.extend(macbeth)
shakespeare.extend(caesar)

I agree w/ Alexis that the ideal is to fetch them all at once from the gutenberg corpus. For anyone in the future looking to concatenate sentences from separate corpuses, you could also try this pythonic approach:

hamlet = gutenberg.sents('shakespeare-hamlet.txt')
macbeth = gutenberg.sents('shakespeare-macbeth.txt')
caesar = gutenberg.sents('shakespeare-caesar.txt')

shakespeare = hamlet + macbeth + caesar

You can use itertools.chain after appending to your list shakespeare :

from itertools import chain

lis = list(chain.from_iterable(shakespeare))

# output:
# [
#   ['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'],
#   ['Actus', 'Primus', '.'],
#   ['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'],
#   ['Actus', 'Primus', '.'],
#   ['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', 'by', 'William', 'Shakespeare', '1599', ']'],
#   ['Actus', 'Primus', '.']
# ]

You could also opt for a list comprehension with a double-loop:

lis = [y for x in shakespeare for y in x]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM