The example which illustrates what I'm trying to do is at part 3.1 of http://www.nltk.org/book/ch07.html
Here is essentially what it is :
import nltk
text = " ..... " #Whatever the text should be
nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()
This generates the tree based on the text
given.
The code I have written seeks to use the input from a text file. So after opening it, I use readlines
to get a String version of it.
import nltk, re, pprint
f = open('sample.txt', 'r')
f1 = f.read().strip()
f2 = ' '.join(f1.split())
nltk.chunk.conllstr2tree(f2, chunk_types=['NP']).draw()
The error I'm getting is :
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-768af8cd2f77> in <module>()
3 f1 = f.read().strip()
4 f2 = ' '.join(f1.split())
----> 5 nltk.chunk.conllstr2tree(f2, chunk_types=['NP']).draw()
/usr/local/lib/python3.4/dist-packages/nltk/chunk/util.py in conllstr2tree(s, chunk_types, root_label)
380 match = _LINE_RE.match(line)
381 if match is None:
--> 382 raise ValueError('Error on line %d' % lineno)
383 (word, tag, state, chunk_type) = match.groups()
384
ValueError: Error on line 0
You are passing in raw string data from sample.txt , trimming whitespace f1
, and then tokenizing on spaces f2
.
If you look at the example from the NTLK book where they mention the chunking method,
nltk.chunk.conllstr2tree(text, chunk_types=['NP']).draw()
the text
variable is a sequence of IOB tagged data like so:
text = """
he PRP B-NP
accepted VBD B-VP
the DT B-NP
position NN I-NP
of IN B-PP
vice NN B-NP
chairman NN I-NP
of IN B-PP
Carlyle NNP B-NP
Group NNP I-NP
, , O
a DT B-NP
merchant NN I-NP
banking NN I-NP
concern NN I-NP
. . O
"""
According to the source code documentation the conllstr2tree method:
Return a chunk structure for a single sentence encoded in the given CONLL 2000 style string. This function converts a CoNLL IOB string into a tree. It uses the specified chunk types (defaults to NP, PP and VP), and creates a tree rooted at a node labeled S (by default).
The problem is that you are simple not passing in the correct format (CoNLL 2000 Wall Street Journal), which should look like so (without the slashes):
token / POS Tag / IOB-Chunk Type
So you will need a couple of extra steps:
It would be unreasonable (for an SO question) to provide an example code snippet as this is quite a lot of work, but hopefully this points you in the right direction!
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.