简体   繁体   中英

Escape parentheses in NLTK parse tree

In NLTK we can convert a parentheses tree into an actual Tree object. However, when a token contains parentheses, the parsing is not what you would expect since NLTK parses those parentheses as a new node.

As an example, take the sentence

They like(d) it a lot

This could be parsed as

(S (NP (PRP They)) (VP like(d) (NP (PRP it)) (NP (DT a) (NN lot))) (. .))

But if you parse this with NLTK into a tree, and output it - it is clear that the (d) is parsed as a new node, which is no surprise.

from nltk import Tree

s = '(S (NP (PRP They)) (VP like(d) (NP (PRP it)) (NP (DT a) (NN lot))) (. .))'

tree = Tree.fromstring(s)
print(tree)

The result is

(S
  (NP (PRP They))
  (VP like (d ) (NP (PRP it)) (NP (DT a) (NN lot)))
  (. .))

So (d ) is a node inside the VP rather than part of the token like . Is there a way in the tree parser to escape parentheses?

Initially I thought this was not possible ... but halfway through writing my answer I found a solution. However the solution is quite messy so I have left my original answer with a slightly better solution.

nltk allows you to provide custom regexes so you can write a regex to match escaped parentheses. The regex ([^\\s\\(\\)\\\\]+(\\\\(?=\\()\\([^\\s\\(\\)\\\\]+\\\\(?=\\))\\))*[\\\\]*)+ will match parentheses escaped by backslashes ( \\ ). This however, will include the escaping backslashes in each leaf so you must write a leaf function to remove these. The following code will properly parse it:

from nltk import Tree

s = '(S (NP (PRP They)) (VP like\(d\) (NP (PRP it)) (NP (DT a) (NN lot))) (. .))'

tree = Tree.fromstring(s, leaf_pattern=r"([^\s\(\)\\]+(\\(?=\()\([^\s\(\)\\]+\\(?=\))\))*[\\]*)+", read_leaf=lambda x: x.replace("\\(", "(").replace("\\)", ")"))
print(tree)

And it outputs:

(S
  (NP (PRP They))
  (VP like(d) (NP (PRP it)) (NP (DT a) (NN lot)))
  (. .))

Original answer

Perhaps you could ask nltk to match another bracket:

from nltk import Tree

s = '[S [NP [PRP They]] [VP like(d) [NP [PRP it]] [NP [DT a] [NN lot]]] [. .]]'

tree = Tree.fromstring(s, brackets='[]')
print(tree)

Which prints out:

(S
  (NP (PRP They))
  (VP like(d) (NP (PRP it)) (NP (DT a) (NN lot)))
  (. .))

You can get different brackets by using the pformat method (which is called internally when you call print):

print(tree.pformat(parens='[]'))

Which prints out:

[S
  [NP [PRP They]]
  [VP like(d) [NP [PRP it]] [NP [DT a] [NN lot]]]
  [. .]]

The traditional method is to convert parentheses into -LRB- and -RRB- within the parse. Most tools that work with Penn Treebank data support this escaping (NLTK, CoreNLP and many others).

NLTK supports this, but its default PTB-style tokenization assumes the parentheses are separate tokens rather than potentially token-internal:

from nltk.tokenize.treebank import TreebankWordTokenizer, TreebankWordDetokenizer

t = TreebankWordTokenizer()
d = TreebankWordDetokenizer()

s = "They like(d) it a lot."

tokens = t.tokenize(s, convert_parentheses=True)
print("Tokens:", tokens)
detokenized = d.detokenize(tokens, convert_parentheses=True)
print("Detokenized:", detokenized)

Output:

Tokens: ['They', 'like', '-LRB-', 'd', '-RRB-', 'it', 'a', 'lot', '.']
Detokenized: They like (d) it a lot.

If you convert the parentheses on your own in your input data without inserting extra spaces, the default tokenization and the detokenization with convert_parentheses=True work:

s = 'They like-LRB-d-RRB- it a lot.'
tokens = t.tokenize(s)
print("Tokens:", tokens)
detokenized = d.detokenize(tokens, convert_parentheses=True)
print("Detokenized:", detokenized)

Output:

Tokens: ['They', 'like-LRB-d-RRB-', 'it', 'a', 'lot', '.']
Detokenized: They like(d) it a lot.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM