简体   繁体   中英

Parse nltk chunk string to form Tree

I have a file containing Strings like

Tree('S', [Tree('NP', [('criminal', 'JJ'), ('lawyer', 'NN')]), Tree('NP', 
[('new', 'JJ'), ('york', 'NN')])])

Is there a python function that parse the string to produce Tree structure again? I tried the Tree.fromstring function but it doesn't parse.

I generate these strings like below

>>> import nltk
>>> from nltk import pos_tag
>>> pattern = """NP: {<DT>?<JJ>*<NN>}
... VBD: {<VBD>}
... IN: {<IN>}"""
>>> NPChunker = nltk.RegexpParser(pattern)
>>> sentence = 'criminal lawyer new york'.split()
>>> pos_tag(sentence)
[('criminal', 'JJ'), ('lawyer', 'NN'), ('new', 'JJ'), ('york', 'NN')]
>>> result = NPChunker.parse(pos_tag(sentence))
>>> result
Tree('S', [Tree('NP', [('criminal', 'JJ'), ('lawyer', 'NN')]), Tree('NP', 
[('new', 'JJ'), ('york', 'NN')])])

Thanks in advance.

When you do

>>> result = NPChunker.parse(pos_tag(sentence))
>>> result
Tree('S', [Tree('NP', [('criminal', 'JJ'), ('lawyer', 'NN')]), Tree('NP', [('new', 'JJ'), ('york', 'NN')])])

you are seeing a string representation of the data structure in memory .

When you type result at the interpreter prompt, what you get is the same as what you get if you type repr(result) at the interpreter prompt. It appears that you have saved this string representation in a file. That is unfortunate because this representation is not acceptable to Tree.fromstring() .

To save an acceptable version to a file you need to write out the str() (not the repr() ) of the tree. You can see the difference here:

>>> result
Tree('S', [Tree('NP', [('criminal', 'JJ'), ('lawyer', 'NN')]), Tree('NP', [('new', 'JJ'), ('york', 'NN')])])
>>> str(result)
'(S (NP criminal/JJ lawyer/NN) (NP new/JJ york/NN))'

Tree.fromstring() is expecting the second of these formats.

To verify that this will do what you want:

>>> result2 = nltk.Tree.fromstring(str(result))
>>> result2
Tree('S', [Tree('NP', ['criminal/JJ', 'lawyer/NN']), Tree('NP', ['new/JJ', 'york/NN'])])

But that is for the future. You need to repair the file you have. Do the following:

>>> from nltk import Tree
>>> input_string = "Tree('S', [Tree('NP', [('criminal', 'JJ'), ('lawyer', 'NN')]), Tree('NP', [('new', 'JJ'), ('york', 'NN')])])"

I'm doing an inline assignment here, but of course you will be reading input_string from a text file.

>>> parsed_tree = eval(input_string)
>>> type(parsed_tree)
<class 'nltk.tree.Tree'>
>>> str(parsed_tree)
'(S (NP criminal/JJ lawyer/NN) (NP new/JJ york/NN))'

This solution is suitable for use as a one-time emergency repair for your file. Don't do it as a regular procedure.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM