简体   繁体   English

Textacy / spacy'subject_verb_object_triples'的更有效实现

[英]More efficient implementation of Textacy / spacy 'subject_verb_object_triples'

I'm trying to implement the 'extract.subject_verb_object_triples' funcation from textacy on my dataset. 我正在尝试从数据集中的文本性实现'extract.subject_verb_object_triples'功能。 However, the code I have written is very slow and memory intensive. 但是,我编写的代码非常慢且占用大量内存。 Is there a more efficient implementation? 是否有更有效的实施方式?

import spacy
import textacy

def extract_SVO(text):

    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    tuples = textacy.extract.subject_verb_object_triples(doc)
    tuples_to_list = list(tuples)
    if tuples_to_list != []:
        tuples_list.append(tuples_to_list)

tuples_list = []          
sp500news['title'].apply(extract_SVO)
print(tuples_list)

Sample data (sp500news) 样本数据(sp500news)

    date_publish  \
0       2013-05-14 17:17:05   
1       2014-05-09 20:15:57   
4       2018-07-19 10:29:54   
6       2012-04-17 21:02:54   
8       2012-12-12 20:17:56   
9       2018-11-08 10:51:49   
11      2013-08-25 07:13:31   
12      2015-01-09 00:54:17   

 title  
0       Italy will not dismantle Montis labour reform  minister                            
1       Exclusive US agency FinCEN rejected veterans in bid to hire lawyers                
4       Xis campaign to draw people back to graying rural China faces uphill battle        
6       Romney begins to win over conservatives                                            
8       Oregon mall shooting survivor in serious condition                                 
9       Polands PGNiG to sign another deal for LNG supplies from US CEO                    
11      Australias opposition leader pledges stronger economy if elected PM                
12      New York shifts into Code Blue to get homeless off frigid streets                  

This should speed it somewhat - 这应该可以加快速度-

import spacy
import textacy
nlp = spacy.load('en_core_web_sm')
def extract_SVO(text):
    tuples = textacy.extract.subject_verb_object_triples(text)
    if tuples:
        tuples_to_list = list(tuples)
        tuples_list.append(tuples_to_list)

tuples_list = []          
sp500news['title'] = sp500news['title'].apply(nlp)
_ = sp500news['title'].apply(extract_SVO)
print(tuples_list)

Explanation 说明

In OP imlementation, nlp = spacy.load('en_core_web_sm') is called so from inside the function it loads everytime. 在OP实现中,会调用nlp = spacy.load('en_core_web_sm') ,因此每次都会从函数内部加载它。 I sense this is the biggest bottleneck. 我觉得这是最大的瓶颈。 This can be taken out and it should speed it up. 可以将其取出,并应加快速度。

Also, the tuple casting to list can happen only if the tuple is not empty. 而且,仅当元组不为空时,才可以将tuple强制转换为list

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM