简体   繁体   English

斯坦福 NER 和 POS,大数据的多线程

[英]Stanford NER and POS, Multithreading for a large data

I am trying to use Stanford NER and Stanford POS Tagger to parse about 23000 documents.我正在尝试使用斯坦福 NER斯坦福 POS Tagger来解析大约 23000 个文档。 I have implemented it using the following pseudocode -我已经使用以下伪代码实现了它 -

`for each in document:
  eachSentences = PunktTokenize(each)
  #code to generate NER Tagger
  #code to generate POS Taggers on the above output`

For a 4 core machine, with 15 GB RAM, the run time just for NER is approximately, 945 hours.对于具有 15 GB RAM 的 4 核机器,仅 NER 的运行时间约为 945 小时。 I have tried to step up things by using the "threading" library, but I get the following error-我试图通过使用“线程”库来加强事情,但我收到以下错误 -

`Exception in thread Thread-2:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "removeStopWords.py", line 75, in partofspeechRecognition
    listOfRes_new = namedEntityRecognition(listRes[min:max])
  File "removeStopWords.py", line 63, in namedEntityRecognition
    listRes_ner.append(namedEntityRecognitionResume(eachResSentence))
  File "removeStopWords.py", line 50, in namedEntityRecognitionResume
    ner2Tags = ner2.tag(each.title().split())
  File "/home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py", line 71, in tag
    return sum(self.tag_sents([tokens]), [])
  File "/home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py", line 98, in tag_sents
    os.unlink(self._input_file_path)
OSError: [Errno 2] No such file or directory: '/tmp/tmpvMNqwB'`

I am using NLTK version - 3.2.1, Stanford NER,POS - 3.7.0 jar file , along with the threading module.我正在使用 NLTK 版本 - 3.2.1、Stanford NER,POS - 3.7.0 jar 文件以及线程模块。 As far as I can see, this might be due to a thread lock on /tmp.据我所知,这可能是由于 /tmp 上的线程锁定所致。 Please correct me if I am wrong, also what is the best way to run the above using threads or a better way to implement it.如果我错了,请纠正我,还有使用线程运行上述内容的最佳方法或实现它的更好方法是什么。

I am using the 3 Class Classifier for NER and Maxent POS Tagger我正在为 NERMaxent POS Tagger使用3 Class Classifier

PS - Please ignore the name of the Python file, I still haven't removed the stopwords or the punctuations from the original text. PS - 请忽略 Python 文件的名称,我仍然没有从原文中删除停用词或标点符号。

Edit - Using cProfile, and sorting on cumulative time, I got the following top 20 calls编辑 - 使用 cProfile,并按累计时间排序,我收到了以下前 20 个电话

600792 function calls (595912 primitive calls) in 60.795 seconds

Ordered by: cumulative time
List reduced from 3357 to 20 due to restriction <20>

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000   60.811   60.811 removeStopWords.py:1(<module>)
    1    0.000    0.000   58.923   58.923 removeStopWords.py:76(partofspeechRecognition)
   28    0.001    0.000   58.883    2.103 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py:69(tag)
   28    0.004    0.000   58.883    2.103 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py:73(tag_sents)
   28    0.001    0.000   56.927    2.033 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:63(java)
  141    0.001    0.000   56.532    0.401 /usr/lib/python2.7/subprocess.py:769(communicate)
  140    0.002    0.000   56.530    0.404 /usr/lib/python2.7/subprocess.py:1408(_communicate)
  140    0.008    0.000   56.492    0.404 /usr/lib/python2.7/subprocess.py:1441(_communicate_with_poll)
  400   56.474    0.141   56.474    0.141 {built-in method poll}
    1    0.001    0.001   43.522   43.522 removeStopWords.py:69(partofspeechRecognitionRes)
    1    0.000    0.000   15.401   15.401 removeStopWords.py:62(namedEntityRecognition)
    1    0.001    0.001   15.367   15.367 removeStopWords.py:46(namedEntityRecognitionRes)
  141    0.004    0.000    2.302    0.016 /usr/lib/python2.7/subprocess.py:651(__init__)
  141    0.020    0.000    2.287    0.016 /usr/lib/python2.7/subprocess.py:1199(_execute_child)
   56    0.002    0.000    1.933    0.035 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:38(config_java)
   56    0.001    0.000    1.931    0.034 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:599(find_binary)
  112    0.002    0.000    1.930    0.017 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:582(find_binary_iter)
  118    0.009    0.000    1.928    0.016 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:453(find_file_iter)
    1    0.001    0.001    1.318    1.318 /usr/lib/python2.7/pickle.py:1383(load)
    1    0.046    0.046    1.317    1.317 /usr/lib/python2.7/pickle.py:851(load) 

It seems like the Python wrapper is the culprit here.似乎 Python 包装器是这里的罪魁祸首。 Java implementation is not taking as much time. Java 实现不会花费太多时间。 It's takes approximately what @Gabor Angeli mentioned.这大约需要@Gabor Angeli 提到的内容。 Try it.尝试一下。

Hope it helps!希望能帮助到你!

May be this is solved already, but still for the people who are trying to speed up Stanford NLP in Python, here is the tried and tested answer.. How to speedup Stanford NLP in Python?可能这已经解决了,但对于那些试图在 Python 中加速斯坦福 NLP的人来说,这里是久经考验的答案.. 如何在 Python 中加速斯坦福 NLP?

Basically it asks you to run NER server in the backend and call sner library and further to do all Stanford NLP related task.基本上它要求您在后端运行 NER 服务器并调用 sner 库并进一步执行所有与斯坦福 NLP 相关的任务。

Found the answer.找到了答案。

Initiate the Stanford NLP Server in background in the folder where Stanford NLP is unzipped.在Stanford NLP 解压后的文件夹中在后台启动Stanford NLP Server。

Portion of the answer given below..下面给出的部分答案..

java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz
Then initiate Stanford NLP Server tagger in Python using sner library.

from sner import Ner
tagger = Ner(host='localhost',port=9199)

Then run the tagger.然后运行标记器。

%%time
classified_text=tagger.get_entities(text)
print (classified_text)
Output:

    [('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 18.2 ms

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM