简体   繁体   中英

NLTK: conllstr2tree does not work properly (Python3)

The example which illustrates what I'm trying to do is at part 3.1 of http://www.nltk.org/book/ch07.html

Here is essentially what it is :

import nltk
text = " ..... "  #Whatever the text should be
nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

This generates the tree based on the text given.
The code I have written seeks to use the input from a text file. So after opening it, I use readlines to get a String version of it.

import nltk, re, pprint
f = open('sample.txt', 'r')
f1 = f.read().strip()
f2 = ' '.join(f1.split())
nltk.chunk.conllstr2tree(f2, chunk_types=['NP']).draw()

The error I'm getting is :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-768af8cd2f77> in <module>()
      3 f1 = f.read().strip()
      4 f2 = ' '.join(f1.split())
----> 5 nltk.chunk.conllstr2tree(f2, chunk_types=['NP']).draw()

/usr/local/lib/python3.4/dist-packages/nltk/chunk/util.py in conllstr2tree(s, chunk_types, root_label)
    380         match = _LINE_RE.match(line)
    381         if match is None:
--> 382             raise ValueError('Error on line %d' % lineno)
    383         (word, tag, state, chunk_type) = match.groups()
    384 

ValueError: Error on line 0

You are passing in raw string data from sample.txt , trimming whitespace f1 , and then tokenizing on spaces f2 .

If you look at the example from the NTLK book where they mention the chunking method,

nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()

the text variable is a sequence of IOB tagged data like so:

text = """
   he PRP B-NP
   accepted VBD B-VP
   the DT B-NP
   position NN I-NP
   of IN B-PP
   vice NN B-NP
   chairman NN I-NP
   of IN B-PP
   Carlyle NNP B-NP
   Group NNP I-NP
   , , O
   a DT B-NP
   merchant NN I-NP
   banking NN I-NP
   concern NN I-NP
   . . O
"""

According to the source code documentation the conllstr2tree method:

Return a chunk structure for a single sentence encoded in the given CONLL 2000 style string. This function converts a CoNLL IOB string into a tree. It uses the specified chunk types (defaults to NP, PP and VP), and creates a tree rooted at a node labeled S (by default).

The problem is that you are simple not passing in the correct format (CoNLL 2000 Wall Street Journal), which should look like so (without the slashes):

token / POS Tag / IOB-Chunk Type

So you will need a couple of extra steps:

  1. Find the likely Part-of-Speech Tag for each word.
  2. Find the chunk type
  3. Prepend the appropriate IOB tag.

It would be unreasonable (for an SO question) to provide an example code snippet as this is quite a lot of work, but hopefully this points you in the right direction!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM