简体   繁体   中英

Out of memory for NLTK tree parsing

I am running a code using NLTK package. For some input sequences, it works. But for some long sequences, the memory is not enough. I tried using a super computer as well, but it again shows "Out of memory". The code is as below which works for this input:

import nltk
# The grammar
grammar = """
S -> L S | L
L -> 'A' F 'U' | 'A' | 'U' F 'A' | 'U' | 'C' F 'G' | 'C' | 'G' F 'C' | 'G'
F -> 'A' F 'U' | 'U' F 'A' | 'C' F 'G' | 'G' F 'C' | L S
"""
# Make a chartparser
parser = nltk.ChartParser(nltk.CFG.fromstring(grammar))

prod_map = {}
for ix, prod in enumerate(nltk.CFG.fromstring(grammar).productions()):
    prod_map[prod] = ix

# The test sentence
sent = [['C', 'C', 'C', 'C', 'A', 'A', 'A',
        'U', 'A', 'C', 'A', 'G', 'A', 'A',
        'G', 'C', 'G', 'G', 'G', 'C', 'U',
        'U', 'A'
       ]]


parse_trees = [next(parser.parse(t)) for t in sent]
        
productions_seq = [tree.productions() for tree in parse_trees]

indices = [np.array([prod_map[prod] for prod in entry], dtype=int) for entry in productions_seq]

one_hot = np.zeros((len(indices), MAX_LEN, NCHARS), dtype=np.float32)
for i in range(len(indices)):
    num_productions = len(indices[i])
    one_hot[i][np.arange(num_productions),indices[i]] = 1.
    one_hot[i][np.arange(num_productions, MAX_LEN),-1] = 1.

But for the bottom input, it reaches out of memory:

sent= [['G', 'A', 'G', 'G', 'A', 'A', 'A', 'G', 'U', 'C', 'C', 'C', 'G', 
           'C', 'C', 'U', 'C', 'C', 'A', 'G', 'A', 'U', 'C', 'A', 'A', 'G', 
           'G', 'G', 'A', 'A', 'G', 'U', 'C', 'C', 'C', 'G', 'C', 'G', 'A'], 
          ['G', 'G', 'G', 'A', 'C', 'A', 'A', 'G', 'G', 'G', 'U', 'A', 'G', 
           'U', 'A', 'C', 'C', 'C', 'U', 'U', 'G', 'G', 'C', 'A', 'A', 'C', 
           'U', 'G', 'C', 'A', 'C', 'A', 'G', 'A', 'A', 'A', 'A', 'C', 'U', 'U']]

Does anyone know what the reason is and how to solve it?

Using nltk.ViterbiParser instead of nltk.ChartParser solved the problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM