繁体 English 中英

使用 Python 进行实时文本处理

[英]Real time text processing using Python

原文 2011-07-08 08:07:25 9 3 python/ performance/ nlp/ text-processing/ nltk

使用 Python 进行实时文本处理。 例如考虑这个句子

I am going to schol today

我想做以下（实时）：

1) tokenize 
2) check spellings
3) stem(nltk.PorterStemmer()) 
4) lemmatize (nltk.WordNetLemmatizer())

目前我正在使用NLTK库来执行这些操作，但它不是实时的（这意味着它需要几秒钟来完成这些操作）。 我一次处理 1 个句子，是否可以使其高效

更新：分析：

Fri Jul  8 17:59:32 2011    srj.profile

         105503 function calls (101919 primitive calls) in 1.743 CPU seconds

   Ordered by: internal time
   List reduced from 1797 to 10 due to restriction 

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     7450    0.136    0.000    0.208    0.000 sre_parse.py:182(__next)
  602/179    0.130    0.000    0.583    0.003 sre_parse.py:379(_parse)
23467/22658    0.122    0.000    0.130    0.000 {len}
 1158/142    0.092    0.000    0.313    0.002 sre_compile.py:32(_compile)
    16152    0.081    0.000    0.081    0.000 {method 'append' of 'list' objects}
     6365    0.070    0.000    0.249    0.000 sre_parse.py:201(get)
     4947    0.058    0.000    0.086    0.000 sre_parse.py:130(__getitem__)
 1641/639    0.039    0.000    0.055    0.000 sre_parse.py:140(getwidth)
      457    0.035    0.000    0.103    0.000 sre_compile.py:207(_optimize_charset)
     6512    0.034    0.000    0.034    0.000 {isinstance}

时间：

t = timeit.Timer(main)
print t.timeit(1000)

=> 3.7256231308

3 个解决方案

NLTK 的WordNetLemmatizer使用延迟加载的 WordNetCorpusReader （使用LazyCorpusLoader ）。 如果触发语料库加载，第一次调用lemmatize()可能会比以后的调用花费更长的时间。

您可以对lemmatize()进行虚拟调用以在应用程序启动时触发加载。

我知道 NLTK 很慢，但我简直不敢相信它这么慢。 在任何情况下，首先进行词干提取，然后进行词形还原是一个坏主意，因为这些操作具有相同的目的，并且将 output 从词干分析器馈送到词形还原器肯定会产生比仅词形还原更差的结果。 所以跳过词干分析器以提高性能和准确性。

没办法这么慢。 我敢打赌正在发生的事情是加载工具和数据来进行词干提取等。如前所述，运行一些测试——1 个句子、10 个句子、100 个句子。

或者，斯坦福解析器可以做同样的事情，并且基于 Java（或 LingPipe）可能会更快一些，但 NLTK 对用户更友好。

在 Python 中使用多线程进行实时视频处理

[英]real time video processing using multithreading in Python

使用Python进行实时数据采集/处理

[英]real time data acquisition/processing using Python

Python中的实时音频处理

[英]Real time audio processing in Python

使用python进行实时音频信号处理

[英]Real-time audio signal processing using python

OpenCV & Python - 实时图像（帧）处理

[英]OpenCV & Python - Real time image (frame) processing

使用python进行文本处理

[英]Text processing using python

如何使用 Python Selenium 从 HTML 读取实时文本

[英]How to read real time text from the HTML using Python Selenium

如何使用在Weka上训练的SVM使用python进行实时处理

[英]How to use SVM trained on Weka for real time processing with python

数据处理中多个文件的实时处理（Python Multiprocessing）

[英]Real-time handling of multiple files in data processing (Python Multiprocessing)

使用 Watson for Python 实现连续实时语音到文本

[英]Continuous Real Time Speech to Text with Watson for Python

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Python 中使用多线程进行实时视频处理使用Python进行实时数据采集/处理 Python中的实时音频处理使用python进行实时音频信号处理 OpenCV & Python - 实时图像（帧）处理使用python进行文本处理如何使用 Python Selenium 从 HTML 读取实时文本如何使用在Weka上训练的SVM使用python进行实时处理数据处理中多个文件的实时处理（Python Multiprocessing）使用 Watson for Python 实现连续实时语音到文本

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM