简体   繁体   中英

Quick NLTK parse into syntax tree

I am trying to parse several hundreds of sentences into their syntax trees and i need to do that fast, the problem is that if i use NLTK then i need to define a grammar, and i cant know that i only know its gonna be english. I tried using this statistical parser, and it works great for my purposes however the speed could be a lot better, is there a way to use nltk parsing without a grammar? In this snippet i am using a processing pool to do the processing in "parallel" but the speed leaves a lot to be desired.

import pickle
import re
from stat_parser.parser import Parser
from multiprocessing import Pool
import HTMLParser
def multy(a):
    global parser
    lst=re.findall('(\S.+?[.!?])(?=\s+|$)',a[1])
    if len(lst)==0:
        lst.append(a[1])
    try:
        ssd=parser.norm_parse(lst[0])
    except:
        ssd=['NNP','nothing']
    with open('/var/www/html/internal','a') as f:
        f.write("[[ss")
        pickle.dump([a[0],ssd], f)
        f.write("ss]]")
if __name__ == '__main__':
    parser=Parser()
    with open('/var/www/html/interface') as f:
        data=f.read()
    data=data.split("\n")
    p = Pool(len(data))
    Totalis_dict=dict()
    listed=list()
    h = HTMLParser.HTMLParser()
    with open('/var/www/html/internal','w') as f:
        f.write("")
    for ind,each in enumerate(data):
        listed.append([str(ind),h.unescape(re.sub('[^\x00-\x7F]+','',each))])
    p.map(multy,listed)

Parsing is a fairly computationally intensive operation. You can probably get much better performance out of a more polished parser, such as bllip . It is written in c++ and benefits from a team having worked on it over a prolonged period. There is a python module which interacts with it.

Here's an example comparing bllip and the parser you are using:

import timeit

# setup stat_parser
from stat_parser import Parser
parser = Parser()

# setup bllip
from bllipparser import RerankingParser
from bllipparser.ModelFetcher import download_and_install_model
# download model (only needs to be done once)
model_dir = download_and_install_model('WSJ', '/tmp/models')
# Loading the model is slow, but only needs to be done once
rrp = RerankingParser.from_unified_model_dir(model_dir)

sentence = "In linguistics, grammar is the set of structural rules governing the composition of clauses, phrases, and words in any given natural language."

if __name__=='__main__':
    from timeit import Timer
    t_bllip = Timer(lambda: rrp.parse(sentence))
    t_stat = Timer(lambda: parser.parse(sentence))
    print "bllip", t_bllip.timeit(number=5)
    print "stat", t_stat.timeit(number=5)

And it runs about 10 times faster on my computer:

(vs)[jonathan@ ~]$ python /tmp/test.py 
bllip 2.57274985313
stat 22.748554945

Also, there's a pull request pending on integrating the bllip parser into NLTK: https://github.com/nltk/nltk/pull/605

Also, you state: "i cant know that i only know its gonna be english" in your question. If by this you mean it needs to parse other languages as well, it will be much more complicated. These statistical parsers are trained on some input, often parsed content from the WSJ in the Penn TreeBanks. Some parses will provide trained models for other languages as well, but you'll need to identify the language first, and load an appropriate model into the parser.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM