[英]Stanford NER and POS, Multithreading for a large data
I am trying to use Stanford NER and Stanford POS Tagger to parse about 23000 documents.我正在尝试使用斯坦福 NER和斯坦福 POS Tagger来解析大约 23000 个文档。 I have implemented it using the following pseudocode -我已经使用以下伪代码实现了它 -
`for each in document:
eachSentences = PunktTokenize(each)
#code to generate NER Tagger
#code to generate POS Taggers on the above output`
For a 4 core machine, with 15 GB RAM, the run time just for NER is approximately, 945 hours.对于具有 15 GB RAM 的 4 核机器,仅 NER 的运行时间约为 945 小时。 I have tried to step up things by using the "threading" library, but I get the following error-我试图通过使用“线程”库来加强事情,但我收到以下错误 -
`Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "removeStopWords.py", line 75, in partofspeechRecognition
listOfRes_new = namedEntityRecognition(listRes[min:max])
File "removeStopWords.py", line 63, in namedEntityRecognition
listRes_ner.append(namedEntityRecognitionResume(eachResSentence))
File "removeStopWords.py", line 50, in namedEntityRecognitionResume
ner2Tags = ner2.tag(each.title().split())
File "/home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py", line 71, in tag
return sum(self.tag_sents([tokens]), [])
File "/home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py", line 98, in tag_sents
os.unlink(self._input_file_path)
OSError: [Errno 2] No such file or directory: '/tmp/tmpvMNqwB'`
I am using NLTK version - 3.2.1, Stanford NER,POS - 3.7.0 jar file , along with the threading module.我正在使用 NLTK 版本 - 3.2.1、Stanford NER,POS - 3.7.0 jar 文件以及线程模块。 As far as I can see, this might be due to a thread lock on /tmp.据我所知,这可能是由于 /tmp 上的线程锁定所致。 Please correct me if I am wrong, also what is the best way to run the above using threads or a better way to implement it.如果我错了,请纠正我,还有使用线程运行上述内容的最佳方法或实现它的更好方法是什么。
I am using the 3 Class Classifier for NER and Maxent POS Tagger我正在为 NER和Maxent POS Tagger使用3 Class Classifier
PS - Please ignore the name of the Python file, I still haven't removed the stopwords or the punctuations from the original text. PS - 请忽略 Python 文件的名称,我仍然没有从原文中删除停用词或标点符号。
Edit - Using cProfile, and sorting on cumulative time, I got the following top 20 calls编辑 - 使用 cProfile,并按累计时间排序,我收到了以下前 20 个电话
600792 function calls (595912 primitive calls) in 60.795 seconds
Ordered by: cumulative time
List reduced from 3357 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 60.811 60.811 removeStopWords.py:1(<module>)
1 0.000 0.000 58.923 58.923 removeStopWords.py:76(partofspeechRecognition)
28 0.001 0.000 58.883 2.103 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py:69(tag)
28 0.004 0.000 58.883 2.103 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/tag/stanford.py:73(tag_sents)
28 0.001 0.000 56.927 2.033 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:63(java)
141 0.001 0.000 56.532 0.401 /usr/lib/python2.7/subprocess.py:769(communicate)
140 0.002 0.000 56.530 0.404 /usr/lib/python2.7/subprocess.py:1408(_communicate)
140 0.008 0.000 56.492 0.404 /usr/lib/python2.7/subprocess.py:1441(_communicate_with_poll)
400 56.474 0.141 56.474 0.141 {built-in method poll}
1 0.001 0.001 43.522 43.522 removeStopWords.py:69(partofspeechRecognitionRes)
1 0.000 0.000 15.401 15.401 removeStopWords.py:62(namedEntityRecognition)
1 0.001 0.001 15.367 15.367 removeStopWords.py:46(namedEntityRecognitionRes)
141 0.004 0.000 2.302 0.016 /usr/lib/python2.7/subprocess.py:651(__init__)
141 0.020 0.000 2.287 0.016 /usr/lib/python2.7/subprocess.py:1199(_execute_child)
56 0.002 0.000 1.933 0.035 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:38(config_java)
56 0.001 0.000 1.931 0.034 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:599(find_binary)
112 0.002 0.000 1.930 0.017 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:582(find_binary_iter)
118 0.009 0.000 1.928 0.016 /home/datascience/pythonEnv/local/lib/python2.7/site-packages/nltk/internals.py:453(find_file_iter)
1 0.001 0.001 1.318 1.318 /usr/lib/python2.7/pickle.py:1383(load)
1 0.046 0.046 1.317 1.317 /usr/lib/python2.7/pickle.py:851(load)
It seems like the Python wrapper is the culprit here.似乎 Python 包装器是这里的罪魁祸首。 Java implementation is not taking as much time. Java 实现不会花费太多时间。 It's takes approximately what @Gabor Angeli mentioned.这大约需要@Gabor Angeli 提到的内容。 Try it.尝试一下。
Hope it helps!希望能帮助到你!
May be this is solved already, but still for the people who are trying to speed up Stanford NLP in Python, here is the tried and tested answer.. How to speedup Stanford NLP in Python?可能这已经解决了,但对于那些试图在 Python 中加速斯坦福 NLP的人来说,这里是久经考验的答案.. 如何在 Python 中加速斯坦福 NLP?
Basically it asks you to run NER server in the backend and call sner library and further to do all Stanford NLP related task.基本上它要求您在后端运行 NER 服务器并调用 sner 库并进一步执行所有与斯坦福 NLP 相关的任务。
Found the answer.找到了答案。
Initiate the Stanford NLP Server in background in the folder where Stanford NLP is unzipped.在Stanford NLP 解压后的文件夹中在后台启动Stanford NLP Server。
Portion of the answer given below..下面给出的部分答案..
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz
Then initiate Stanford NLP Server tagger in Python using sner library.
from sner import Ner
tagger = Ner(host='localhost',port=9199)
Then run the tagger.然后运行标记器。
%%time
classified_text=tagger.get_entities(text)
print (classified_text)
Output:
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'PERSON'), ('Doe', 'PERSON')]
CPU times: user 4 ms, sys: 0 ns, total: 4 ms
Wall time: 18.2 ms
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.