從NLTK語料庫中檢索句子字符串

Question

這是我的數據集：

emma=gutenberg.sents('austen-emma.txt')

它給了我句子

[[u'she',u'was',u'happy',[u'It',u'was',u'her',u'own',u'good']]

但這就是我想要得到的：

['she was happy','It was her own good']

Answer 1

根據nltk docs ，您似乎正在獲得正確的輸出：

sends（fileids = None）[source]¶返回：給定的文件是一個句子或話語列表，每個都編碼為一個字符串列表。

因此，您只需要將字串列表變回以空格分隔的句子即可：

sentences = [" ".join(list_of_words) for list_of_words in emma]

Answer 2

使用nltk.corpus API訪問的語料庫通常返回文檔流，即句子列表，每個句子都是標記列表。

>>> from nltk.corpus import gutenberg
>>> emma = gutenberg.sents('austen-emma.txt')
>>> emma[0]
[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', u']']
>>> emma[1]
[u'VOLUME', u'I']
>>> emma[2]
[u'CHAPTER', u'I']
>>> emma[3]
[u'Emma', u'Woodhouse', u',', u'handsome', u',', u'clever', u',', u'and', u'rich', u',', u'with', u'a', u'comfortable', u'home', u'and', u'happy', u'disposition', u',', u'seemed', u'to', u'unite', u'some', u'of', u'the', u'best', u'blessings', u'of', u'existence', u';', u'and', u'had', u'lived', u'nearly', u'twenty', u'-', u'one', u'years', u'in', u'the', u'world', u'with', u'very', u'little', u'to', u'distress', u'or', u'vex', u'her', u'.']

對於nltk.corpus.gutenberg語料庫，它將加載PlaintextCorpusReader ，請參閱https://github.com/nltk/nltk/blob/develop/nltk/corpus/ init .py＃L114和https://github.com/nltk /nltk/blob/develop/nltk/corpus/reader/plaintext.py

因此，它正在讀取文本文件目錄，其中一個是'austen-emma.txt'並且它使用默認的sent_tokenize和word_tokenize函數來處理語料庫。 在代碼中將其實例化為tokenizers/punkt/english.pickle和WordPunctTokenizer() ，請參見https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L40

因此，要獲取所需的句子字符串列表，請使用：

>>> from nltk.corpus import gutenberg
>>> emma = gutenberg.sents('austen-emma.txt')
>>> sents_list = [" ".join(sent) for sent in emma]
>>> sents_list[0]
u'[ Emma by Jane Austen 1816 ]'
>>> sents_list[1]
u'VOLUME I'
>>> sents_list[:1]
[u'[ Emma by Jane Austen 1816 ]']
>>> sents_list[:2]
[u'[ Emma by Jane Austen 1816 ]', u'VOLUME I']
>>> sents_list[:3]
[u'[ Emma by Jane Austen 1816 ]', u'VOLUME I', u'CHAPTER I']

Answer 3

正如alvas和AShelly指出的那樣，您看到的是正確的行為。 但是，他們僅連接每個句子的單詞的方法有兩個缺點：

您最終會在標點符號周圍出現空白（例如， "Emma Woodhouse , handsome , clever , and rich , with a comfortable [...]" （ "Emma Woodhouse , handsome , clever , and rich , with a comfortable [...]" ）。
您讓PlaintextCorpusReader執行句子標記化只是為了隨后將其還原，這是可以避免的計算開銷。

給定PlaintextCorpusReader的實現，很容易派生一個函數，該函數采取與PlaintextCorpusReader.sents()完全相同的步驟，但沒有句子標記化：

def sentences_from_corpus(corpus, fileids = None):

    from nltk.corpus.reader.plaintext import read_blankline_block, concat

    def read_sent_block(stream):
        sents = []
        for para in corpus._para_block_reader(stream):
            sents.extend([s.replace('\n', ' ') for s in corpus._sent_tokenizer.tokenize(para)])
        return sents

    return concat([corpus.CorpusView(path, read_sent_block, encoding=enc)
                   for (path, enc, fileid)
                   in corpus.abspaths(fileids, True, True)])

與我上面所說的相反，此功能執行了一個附加步驟：由於我們不再進行單詞標記化，因此必須用空格替換換行符。

將gutenberg語料庫傳遞給此函數將導致：

['[Emma by Jane Austen 1816]',
 'VOLUME I',
 'CHAPTER I',
 'Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition, seemed to unite some of the best blessings of existence; and had lived nearly twenty-one years in the world with very little to distress or vex her.',
 "She was the youngest of the two daughters of a most affectionate, indulgent father; and had, in consequence of her sister's marriage, been mistress of his house from a very early period.",
 ...]

從NLTK語料庫中檢索句子字符串

問題描述

3 個解決方案

解決方案1
3 2015-05-11 14:50:29

解決方案2
2 2015-05-12 09:51:43

解決方案3
1 2017-07-10 18:04:29

從NLTK語料庫中檢索句子字符串

問題描述

3 個解決方案

解決方案1 3 2015-05-11 14:50:29

解決方案2 2 2015-05-12 09:51:43

解決方案3 1 2017-07-10 18:04:29

解決方案1
3 2015-05-11 14:50:29

解決方案2
2 2015-05-12 09:51:43

解決方案3
1 2017-07-10 18:04:29