简体   繁体   中英

Parsing a single tree for a sequence using NLTK in Python

I want to parse a tree for an RNA sequence. I tokenized the RNA sequence in a list as is shown in the code below and parsed the trees:

from __future__ import print_function
import nltk
import pdb
import numpy as np
import h5py
import RNA_vae
import equation_vae_copy
import RNA_grammar

sent = ['C', 'C', 'C', 'C', 'A', 'A', 'A', 'U', 'A', 'C', 'A', 'G', 'A', 'A', 'G', 'C', 'G', 'G', 'G', 'C', 'U', 'U', 'A']
parser = nltk.ChartParser(RNA_grammar.GCFG) 
parse_trees = [next(parser.parse(t)) for t in sent]

print(parse_trees)

But the output of the code is as below:

[Tree('S', [Tree('L', ['C'])]), Tree('S', [Tree('L', ['C'])]), Tree('S', [Tree('L', ['C'])]), Tree('S', [Tree('L', ['C'])]), Tree('S', [Tree('L', ['A'])]), Tree('S', [Tree('L', ['A'])]), Tree('S', [Tree('L', ['A'])]), Tree('S', [Tree('L', ['U'])]), Tree('S', [Tree('L', ['A'])]), Tree('S', [Tree('L', ['C'])]), Tree('S', [Tree('L', ['A'])]), Tree('S', [Tree('L', ['G'])]), Tree('S', [Tree('L', ['A'])]), Tree('S', [Tree('L', ['A'])]), Tree('S', [Tree('L', ['G'])]), Tree('S', [Tree('L', ['C'])]), Tree('S', [Tree('L', ['G'])]), Tree('S', [Tree('L', ['G'])]), Tree('S', [Tree('L', ['G'])]), Tree('S', [Tree('L', ['C'])]), Tree('S', [Tree('L', ['U'])]), Tree('S', [Tree('L', ['U'])]), Tree('S', [Tree('L', ['A'])])]

I want to make a tree for the whole of the sequence, but it makes the trees for each of the characters in RNA. How can I generate a single tree for whole of the sequence?

The grammar is as below:

# the RNA grammar
gram = """S -> LS
S -> L
LS -> L
LS -> S
L -> AFU
L -> UFA
L -> GFC
L -> CFG
L -> 'A'
L -> 'U'
L -> 'C'
L -> 'G'
F -> AFU
F -> UFA
F -> GFC
F -> CFG
F -> LS
AFU -> 'A'
AFU -> F
AFU -> 'U'
UFA -> 'U'
UFA -> F
UFA -> 'A'
GFC -> 'G'
GFC -> F
GFC -> 'C'
CFG -> 'C'
CFG -> F
CFG -> 'G'
Nothing -> Nones
"""

The grammar must be as below:

RNA语法

Then, I changed the grammar as follows, but it still fails to parse a sequence:

gram = """S -> L S | L
L -> 'A' F 'U' | 'A' | 'U' F 'A' | 'U' | 'C' F 'G' | 'C' | 'G' F 'C' | 'G'
F -> 'A' F 'U' | 'U' F 'A' | 'C' F 'G' | 'G' F 'C' | L S
Nothing -> Nones
"""

As discussed in the comments, you started with two fundamental problems:

  1. The grammar you wrote was only capable of handling a single character

  2. You called your parser with one character each time.

The result was a vector of "parses" of each character, separately.

After fixing your grammar, as indicated in the editted question, changing the call to parser.parse to provide the entire sequence to be parsed produces 2100 possible parses.

Here's what I did (and you can do it, too, by just copying the following code block into your python console):

# import only what's needed
import nltk
# The grammar
grammar = """
S -> L S | L
L -> 'A' F 'U' | 'A' | 'U' F 'A' | 'U' | 'C' F 'G' | 'C' | 'G' F 'C' | 'G'
F -> 'A' F 'U' | 'U' F 'A' | 'C' F 'G' | 'G' F 'C' | L S
"""
# Make a chartparser
parser = nltk.ChartParser(nltk.CFG.fromstring(grammar))
# The test sentence
sent = ['C', 'C', 'C', 'C', 'A', 'A', 'A',
        'U', 'A', 'C', 'A', 'G', 'A', 'A',
        'G', 'C', 'G', 'G', 'G', 'C', 'U',
        'U', 'A'
       ]
# Get all of the parses
parses = list(parser.parse(sent))
# There are a lot of them. len(parses) is 2100.
# Print one of them to the console
parses[0].pprint()

That prints:

(S
  (L C)
  (S
    (L C)
    (S
      (L C)
      (S
        (L C)
        (S
          (L A)
          (S
            (L A)
            (S
              (L A)
              (S
                (L
                  U
                  (F
                    (L A)
                    (S
                      (L C)
                      (S
                        (L A)
                        (S
                          (L G)
                          (S
                            (L
                              A
                              (F
                                A
                                (F G (F C (F (L G) (S (L G))) G) C)
                                U)
                              U))))))
                  A)))))))))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM