简体   繁体   English

在 spacy 中序列化自定义 function 以进行多处理

[英]Serializing custom function in spacy for multiprocessing

I have a custom entity scrubber function that basically tells textrank algorithm(using this to extract key phrases) to avoid tagging certain entities as key phrases.我有一个自定义实体洗涤器 function,它基本上告诉 textrank 算法(使用它来提取关键短语)避免将某些实体标记为关键短语。 I register this function to spacy using:我使用以下方法将此 function 注册到 spacy:

@spacy.registry.misc("entity_scrubber")
def articles_scrubber():
    def scrubber_func(span: Span) -> str:
        for token in span:
            if token.ent_type_ in ['CARDINAL', 'DATE', 'MONEY', 'ORDINAL', 'PERCENT',
                                 'PERSON', 'QUANTITY', 'TIME']: # people, places, dates 
                # ignore named entities
                return "INELIGIBLE_PHRASE"
        return span.text
    return scrubber_func

I add the textrank component using the following line of code and also pass the custom function in the config parameter:我使用以下代码行添加 textrank 组件,并在配置参数中传递自定义 function:

nlp.add_pipe("textrank", config={"scrubber": {"@misc": "entity_scrubber"}})

I try to process the docs using the nlp.pipe method and spacy's multiprocessing capability by passing n_process param我尝试通过传递 n_process 参数使用 nlp.pipe 方法和 spacy 的多处理能力来处理文档

nlp.pipe(docs, n_process=8, disable=['tok2vec','tagger','lemmatizer', 'attribute_ruler'])

But I get the following error:但我收到以下错误:

AttributeError: Can't pickle local object 'articles_scrubber.<locals>.scrubber_func'

After exploring online, I found that decorating the function in this way(nested functions) won't be pickable.网上一查,发现function这样装饰(嵌套函数)不会pickable。 I have found some solutions online which( http://gael-varoquaux.info/programming/decoration-in-python-done-right-decorating-and-pickling.html , https://towardsdatascience.com/why-you-should-wrap-decorators-in-python-5ac3676835f9 ) show how to decorate the functions right away so that they are pickable, but I couldn't find any approach in reference to spacy.我在网上找到了一些解决方案( http: //gael-varoquaux.info/programming/decoration-in-python-done-right-decorating-and-pickling.html,https://towardsdatascience.com/why-you- should-wrap-decorators-in-python-5ac3676835f9 ) 展示了如何立即装饰函数以便它们是可挑选的,但我找不到任何参考 spacy 的方法。

Can someone point me in a direction on how can I approach this?有人可以指出我如何处理这个问题的方向吗?

The problem has nothing to do with the function being decorated but rather that the actual worker function that will be invoked in the new process is not at global scope. In the following demo I have a decorator time_it that prints out the running time of the function being decorated.该问题与正在装饰的 function 无关,而是将在新进程中调用的实际工作人员function 不在全局 scope。在下面的演示中,我有一个装饰器time_it打印出 function 的运行时间正在装修。 I use it to decorate both the function get_worker_function that returns the worker function that will run in a child process as well as to decorate that worker function, foo , itself.我用它来装饰 function get_worker_function返回将在子进程中运行的 worker function 以及装饰 worker function, foo本身。 There is no problem when the worker function is at global scope:当worker function在global scope时没有问题:

from timing import time_it
from multiprocessing import Process

@time_it
def foo():
    print("It works!")

@time_it
def get_worker_function():
    # Return actual worker function
    return foo

if __name__ == '__main__':
    worker_function = get_worker_function()
    p = Process(target=worker_function)
    p.start()
    p.join()

Prints:印刷:

func: get_worker_function args: [(), {}] took: 6e-07 sec.
It works!
func: foo args: [(), {}] took: 0.0001581 sec.

But if the worker function is not at global scope, then you get an error:但是,如果 worker function 不在全局 scope,则会出现错误:

from timing import time_it
from multiprocessing import Process


@time_it
def get_worker_function():
    # Return actual worker function
    def foo():
        print("It works!")
    return foo

if __name__ == '__main__':
    worker_function = get_worker_function()
    p = Process(target=worker_function)
    p.start()
    p.join()

Prints:印刷:

func: get_worker_function args: [(), {}] took: 7e-07 sec.
Traceback (most recent call last):
  File "C:\Booboo\test\test.py", line 15, in <module>
    p.start()
  File "C:\Program Files\Python38\lib\multiprocessing\process.py", line 121, in start
    self._popen = self._Popen(self)
  File "C:\Program Files\Python38\lib\multiprocessing\context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Program Files\Python38\lib\multiprocessing\context.py", line 327, in _Popen
    return Popen(process_obj)
  File "C:\Program Files\Python38\lib\multiprocessing\popen_spawn_win32.py", line 93, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Program Files\Python38\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'get_worker_function.<locals>.foo'

Solution?解决方案?

Try changing your code to this:尝试将您的代码更改为:

def scrubber_func(span: Span) -> str:
    for token in span:
        if token.ent_type_ in ['CARDINAL', 'DATE', 'MONEY', 'ORDINAL', 'PERCENT',
                             'PERSON', 'QUANTITY', 'TIME']: # people, places, dates 
            # ignore named entities
            return "INELIGIBLE_PHRASE"
    return span.text

@spacy.registry.misc("entity_scrubber")
def articles_scrubber():
    return scrubber_func

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM