简体   繁体   中英

Real time text processing using Python

Real time text processing using Python. For eg consider this sentance

I am going to schol today

I want to do the following (real time):

1) tokenize 
2) check spellings
3) stem(nltk.PorterStemmer()) 
4) lemmatize (nltk.WordNetLemmatizer())

Currently I am using NLTK library to do these operations, but its not real time (meaning its taking few seconds to complete these operations). I am processing 1 sentence at a time, Is it possible to make it efficient

Update: Profiling:

Fri Jul  8 17:59:32 2011    srj.profile

         105503 function calls (101919 primitive calls) in 1.743 CPU seconds

   Ordered by: internal time
   List reduced from 1797 to 10 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7450    0.136    0.000    0.208    0.000 sre_parse.py:182(__next)
  602/179    0.130    0.000    0.583    0.003 sre_parse.py:379(_parse)
23467/22658    0.122    0.000    0.130    0.000 {len}
 1158/142    0.092    0.000    0.313    0.002 sre_compile.py:32(_compile)
    16152    0.081    0.000    0.081    0.000 {method 'append' of 'list' objects}
     6365    0.070    0.000    0.249    0.000 sre_parse.py:201(get)
     4947    0.058    0.000    0.086    0.000 sre_parse.py:130(__getitem__)
 1641/639    0.039    0.000    0.055    0.000 sre_parse.py:140(getwidth)
      457    0.035    0.000    0.103    0.000 sre_compile.py:207(_optimize_charset)
     6512    0.034    0.000    0.034    0.000 {isinstance}

timit:

t = timeit.Timer(main)
print t.timeit(1000)

=> 3.7256231308

NLTK's WordNetLemmatizer uses a lazily-loaded WordNetCorpusReader (using a LazyCorpusLoader ). The first call to lemmatize() may take significantly longer than later calls if it triggers the corpus loading.

You could place a dummy call to lemmatize() to trigger the loading when your application starts up.

I know NLTK is slow, but I can hardly believe it's that slow. In any case, first stemming, then lemmatizing is a bad idea, since these operations serve the same purpose and feeding the output from a stemmer to a lemmatizer is bound to give worse results than just lemmatizing. So skip the stemmer for an increase in both performance and accuracy.

No way it's that slow. I bet what's happening is loading the tools and data to do the stemming etc. As said earlier, run a few tests- 1 sentence, 10 sentences, 100 sentences.

Alternatively, the Stanford parser can do the same stuff and might be a bit quicker being Java based (or LingPipe) but NLTK is waaaaaay more user friendly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM