出 memory 用於 NLTK 樹解析

Question

我正在使用 NLTK package 運行代碼。 對於某些輸入序列，它可以工作。 但是對於一些長序列，memory 是不夠的。 我也嘗試使用超級計算機，但它再次顯示“內存不足”。 代碼如下，適用於此輸入：

import nltk
# The grammar
grammar = """
S -> L S | L
L -> 'A' F 'U' | 'A' | 'U' F 'A' | 'U' | 'C' F 'G' | 'C' | 'G' F 'C' | 'G'
F -> 'A' F 'U' | 'U' F 'A' | 'C' F 'G' | 'G' F 'C' | L S
"""
# Make a chartparser
parser = nltk.ChartParser(nltk.CFG.fromstring(grammar))

prod_map = {}
for ix, prod in enumerate(nltk.CFG.fromstring(grammar).productions()):
    prod_map[prod] = ix

# The test sentence
sent = [['C', 'C', 'C', 'C', 'A', 'A', 'A',
        'U', 'A', 'C', 'A', 'G', 'A', 'A',
        'G', 'C', 'G', 'G', 'G', 'C', 'U',
        'U', 'A'
       ]]


parse_trees = [next(parser.parse(t)) for t in sent]
        
productions_seq = [tree.productions() for tree in parse_trees]

indices = [np.array([prod_map[prod] for prod in entry], dtype=int) for entry in productions_seq]

one_hot = np.zeros((len(indices), MAX_LEN, NCHARS), dtype=np.float32)
for i in range(len(indices)):
    num_productions = len(indices[i])
    one_hot[i][np.arange(num_productions),indices[i]] = 1.
    one_hot[i][np.arange(num_productions, MAX_LEN),-1] = 1.

但是對於底部輸入，它超出了 memory：

sent= [['G', 'A', 'G', 'G', 'A', 'A', 'A', 'G', 'U', 'C', 'C', 'C', 'G', 
           'C', 'C', 'U', 'C', 'C', 'A', 'G', 'A', 'U', 'C', 'A', 'A', 'G', 
           'G', 'G', 'A', 'A', 'G', 'U', 'C', 'C', 'C', 'G', 'C', 'G', 'A'], 
          ['G', 'G', 'G', 'A', 'C', 'A', 'A', 'G', 'G', 'G', 'U', 'A', 'G', 
           'U', 'A', 'C', 'C', 'C', 'U', 'U', 'G', 'G', 'C', 'A', 'A', 'C', 
           'U', 'G', 'C', 'A', 'C', 'A', 'G', 'A', 'A', 'A', 'A', 'C', 'U', 'U']]

有誰知道是什么原因以及如何解決？

Answer 1

使用nltk.ViterbiParser而不是nltk.ChartParser解決了這個問題。

出 memory 用於 NLTK 樹解析

問題描述

1 個解決方案

解決方案1
0 2021-06-04 04:54:08

出 memory 用於 NLTK 樹解析

問題描述

1 個解決方案

解決方案1 0 2021-06-04 04:54:08

解決方案1
0 2021-06-04 04:54:08