Real time text processing using Python. For eg consider this sentance
I am going to schol today
I want to do the following (real time):
1) tokenize 2) check spellings 3) stem(nltk.PorterStemmer()) 4) lemmatize (nltk.WordNetLemmatizer())
Currently I am using NLTK library to do these operations, but its not real time (meaning its taking few seconds to complete these operations). I am processing 1 sentence at a time, Is it possible to make it efficient
Update: Profiling:
Fri Jul 8 17:59:32 2011 srj.profile 105503 function calls (101919 primitive calls) in 1.743 CPU seconds Ordered by: internal time List reduced from 1797 to 10 due to restriction ncalls tottime percall cumtime percall filename:lineno(function) 7450 0.136 0.000 0.208 0.000 sre_parse.py:182(__next) 602/179 0.130 0.000 0.583 0.003 sre_parse.py:379(_parse) 23467/22658 0.122 0.000 0.130 0.000 {len} 1158/142 0.092 0.000 0.313 0.002 sre_compile.py:32(_compile) 16152 0.081 0.000 0.081 0.000 {method 'append' of 'list' objects} 6365 0.070 0.000 0.249 0.000 sre_parse.py:201(get) 4947 0.058 0.000 0.086 0.000 sre_parse.py:130(__getitem__) 1641/639 0.039 0.000 0.055 0.000 sre_parse.py:140(getwidth) 457 0.035 0.000 0.103 0.000 sre_compile.py:207(_optimize_charset) 6512 0.034 0.000 0.034 0.000 {isinstance}
timit:
t = timeit.Timer(main) print t.timeit(1000) => 3.7256231308
NLTK's WordNetLemmatizer
uses a lazily-loaded WordNetCorpusReader (using a LazyCorpusLoader
). The first call to lemmatize()
may take significantly longer than later calls if it triggers the corpus loading.
You could place a dummy call to lemmatize()
to trigger the loading when your application starts up.
I know NLTK is slow, but I can hardly believe it's that slow. In any case, first stemming, then lemmatizing is a bad idea, since these operations serve the same purpose and feeding the output from a stemmer to a lemmatizer is bound to give worse results than just lemmatizing. So skip the stemmer for an increase in both performance and accuracy.
No way it's that slow. I bet what's happening is loading the tools and data to do the stemming etc. As said earlier, run a few tests- 1 sentence, 10 sentences, 100 sentences.
Alternatively, the Stanford parser can do the same stuff and might be a bit quicker being Java based (or LingPipe) but NLTK is waaaaaay more user friendly.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.