简体繁体中英

Real time text processing using Python

原文 2011-07-08 08:07:25 8 3 python/ performance/ nlp/ text-processing/ nltk

Real time text processing using Python. For eg consider this sentance

I am going to schol today

I want to do the following (real time):

1) tokenize 
2) check spellings
3) stem(nltk.PorterStemmer()) 
4) lemmatize (nltk.WordNetLemmatizer())

Currently I am using NLTK library to do these operations, but its not real time (meaning its taking few seconds to complete these operations). I am processing 1 sentence at a time, Is it possible to make it efficient

Update: Profiling:

Fri Jul  8 17:59:32 2011    srj.profile

         105503 function calls (101919 primitive calls) in 1.743 CPU seconds

   Ordered by: internal time
   List reduced from 1797 to 10 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7450    0.136    0.000    0.208    0.000 sre_parse.py:182(__next)
  602/179    0.130    0.000    0.583    0.003 sre_parse.py:379(_parse)
23467/22658    0.122    0.000    0.130    0.000 {len}
 1158/142    0.092    0.000    0.313    0.002 sre_compile.py:32(_compile)
    16152    0.081    0.000    0.081    0.000 {method 'append' of 'list' objects}
     6365    0.070    0.000    0.249    0.000 sre_parse.py:201(get)
     4947    0.058    0.000    0.086    0.000 sre_parse.py:130(__getitem__)
 1641/639    0.039    0.000    0.055    0.000 sre_parse.py:140(getwidth)
      457    0.035    0.000    0.103    0.000 sre_compile.py:207(_optimize_charset)
     6512    0.034    0.000    0.034    0.000 {isinstance}

timit:

t = timeit.Timer(main)
print t.timeit(1000)

=> 3.7256231308

3 answers

NLTK's WordNetLemmatizer uses a lazily-loaded WordNetCorpusReader (using a LazyCorpusLoader ). The first call to lemmatize() may take significantly longer than later calls if it triggers the corpus loading.

You could place a dummy call to lemmatize() to trigger the loading when your application starts up.

I know NLTK is slow, but I can hardly believe it's that slow. In any case, first stemming, then lemmatizing is a bad idea, since these operations serve the same purpose and feeding the output from a stemmer to a lemmatizer is bound to give worse results than just lemmatizing. So skip the stemmer for an increase in both performance and accuracy.

No way it's that slow. I bet what's happening is loading the tools and data to do the stemming etc. As said earlier, run a few tests- 1 sentence, 10 sentences, 100 sentences.

Alternatively, the Stanford parser can do the same stuff and might be a bit quicker being Java based (or LingPipe) but NLTK is waaaaaay more user friendly.

real time video processing using multithreading in Python

real time data acquisition/processing using Python

Real time audio processing in Python

Real-time audio signal processing using python

OpenCV & Python - Real time image (frame) processing

Text processing using python

How to read real time text from the HTML using Python Selenium

How to use SVM trained on Weka for real time processing with python

Real-time handling of multiple files in data processing (Python Multiprocessing)

Continuous Real Time Speech to Text with Watson for Python

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question real time video processing using multithreading in Python real time data acquisition/processing using Python Real time audio processing in Python Real-time audio signal processing using python OpenCV & Python - Real time image (frame) processing Text processing using python How to read real time text from the HTML using Python Selenium How to use SVM trained on Weka for real time processing with python Real-time handling of multiple files in data processing (Python Multiprocessing) Continuous Real Time Speech to Text with Watson for Python

Related Tags

Real time text processing using Python

Question

3 answers

solution1
3 2011-07-09 00:55:03

solution2
1 2011-07-08 08:12:22

solution3
1 2011-07-09 00:06:40

Real time text processing using Python

Question

3 answers

solution1 3 2011-07-09 00:55:03

solution2 1 2011-07-08 08:12:22

solution3 1 2011-07-09 00:06:40

solution1
3 2011-07-09 00:55:03

solution2
1 2011-07-08 08:12:22

solution3
1 2011-07-09 00:06:40