Python :: NLTK concatenating list of sentences

Question

NLTK http://www.nltk.org/ is a toolkit for computational linguistics.

I am trying to manipulate sentences, using the sents() method:

from nltk.corpus import gutenberg

it fetches texts by fileid :

hamlet = gutenberg.sents('shakespeare-hamlet.txt')

the output is:

print hamlet
[['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...]

But let's say I want to make a list of sentences by author instead of by book. In a repetitive way (it won't let me extend() lists):

shakespeare = []

hamlet = gutenberg.sents('shakespeare-hamlet.txt')
macbeth = gutenberg.sents('shakespeare-macbeth.txt')
caesar = gutenberg.sents('shakespeare-caesar.txt')

shakespeare.append(hamlet)
shakespeare.append(macbeth)
shakespeare.append(caesar)

but then it all becomes nested:

print shakespeare

[[['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...]]

Is there a way I can end up with ONE list with all concatenated sentences, not nested, like this?

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'], ['Actus', 'Primus', '.'], ...], [['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', 'by', 'William', 'Shakespeare', '1599', ']'], ['Actus', 'Primus', '.'], ...]]

Answer 1

The best solution is to just fetch them all at once-- the sentences come the way you want them. The nltk's corpus readers accept either a single filename or a list of files:

shakespeare = gutenberg.sents(['shakespeare-hamlet.txt',
                 'shakespeare-macbeth.txt', 'shakespeare-caesar.txt'])

In other situations, if you have several lists and you want to concatenate them you should use extend() , not append() :

shakespeare.extend(macbeth)
shakespeare.extend(caesar)

Answer 2

I agree w/ Alexis that the ideal is to fetch them all at once from the gutenberg corpus. For anyone in the future looking to concatenate sentences from separate corpuses, you could also try this pythonic approach:

hamlet = gutenberg.sents('shakespeare-hamlet.txt')
macbeth = gutenberg.sents('shakespeare-macbeth.txt')
caesar = gutenberg.sents('shakespeare-caesar.txt')

shakespeare = hamlet + macbeth + caesar

Answer 3

You can use itertools.chain after appending to your list shakespeare :

from itertools import chain

lis = list(chain.from_iterable(shakespeare))

# output:
# [
#   ['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']'],
#   ['Actus', 'Primus', '.'],
#   ['[', 'The', 'Tragedie', 'of', 'Macbeth', 'by', 'William', 'Shakespeare', '1603', ']'],
#   ['Actus', 'Primus', '.'],
#   ['[', 'The', 'Tragedie', 'of', 'Julius', 'Caesar', 'by', 'William', 'Shakespeare', '1599', ']'],
#   ['Actus', 'Primus', '.']
# ]

You could also opt for a list comprehension with a double-loop:

lis = [y for x in shakespeare for y in x]

Python :: NLTK concatenating list of sentences

Question

3 answers

solution1
2 2016-06-12 00:15:48

solution2
1 2018-08-04 03:55:05

solution3
0 ACCPTED 2016-06-08 04:36:35

Python :: NLTK concatenating list of sentences

Question

3 answers

solution1 2 2016-06-12 00:15:48

solution2 1 2018-08-04 03:55:05

solution3 0 ACCPTED 2016-06-08 04:36:35

solution1
2 2016-06-12 00:15:48

solution2
1 2018-08-04 03:55:05

solution3
0 ACCPTED 2016-06-08 04:36:35