Faster NER extraction using SpaCy and Pandas

Question

I have a df with a column that contains comments from which I want to extract the organisations. This article provides a great approach but it is too slow for my problem. The df I am using has over 1,000,000 rows and I am using a Google Colab notebook.

Currently my approach is (from the linked article):

def get_orgs(text):
    # process the text with our SpaCy model to get named entities
    doc = nlp(text)
    # initialize list to store identified organizations
    org_list = []
    # loop through the identified entities and append ORG entities to org_list
    for entity in doc.ents:
        if entity.label_ == 'ORG':
            org_list.append(entity.text)
    # if organization is identified more than once it will appear multiple times in list
    # we use set() to remove duplicates then convert back to list
    org_list = list(set(org_list))
    return org_list

df['organizations'] = df['body'].apply(get_orgs)

Is there a faster way to process this? And, would you advise to apply it to a Pandas df or are there better/faster alternatives?

Answer 1

There are a couple of things you can do in general to speed up spaCy. There's a section in the docs on this.

The first thing to try is creating docs in a pipe. You'll need to be a little creative to get this working with a dataframe:

org_lists = []
for doc in nlp.pipe(iter(df['body']):
    org_lists.append(...) # do your processing here
# now you can add a column in your dataframe

The other thing is to disable components you aren't using. Since it looks like you're only using NER you can do this:

for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
    # Do something with the doc here

Those together should give you a significant speedup.

Faster NER extraction using SpaCy and Pandas

Question

1 answers

solution1
0 2021-03-30 10:24:49

Faster NER extraction using SpaCy and Pandas

Question

1 answers

solution1 0 2021-03-30 10:24:49

solution1
0 2021-03-30 10:24:49