[英]Python - difference between tagged_sents and tagged_words in NLTK corpora
nltkagged_sents和tagd_words有什么區別?
它們似乎都是帶有元組(單詞,標簽)的列表。 如果您輸入type(),它們都是
nltk.collections.LazySubsequence
從文檔 :
Corpus reader functions are named based on the type of information they return.
Some common examples, and their return types, are:
- words(): list of str
- sents(): list of (list of str)
- paras(): list of (list of (list of str))
- tagged_words(): list of (str,str) tuple
- tagged_sents(): list of (list of (str,str))
- tagged_paras(): list of (list of (list of (str,str)))
- chunked_sents(): list of (Tree w/ (str,str) leaves)
- parsed_sents(): list of (Tree with str leaves)
- parsed_paras(): list of (list of (Tree with str leaves))
- xml(): A single xml ElementTree
- raw(): unprocessed corpus contents
>>> from nltk.corpus import brown
>>> brown.tagged_words()
[(u'The', u'AT'), (u'Fulton', u'NP-TL'), ...]
>>> len(brown.tagged_words()) # no. of words in the corpus.
1161192
>>> len(brown.tagged_sents()) # no. of sentence in the corpus.
57340
# Loop through the sentences and counts the words per sentence.
>>> sum(len(sent) for sent in brown.tagged_sents()) # no. of words in the corpus.
1161192
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.