简体   繁体   中英

How to extract Person Names from a data frame in Python using Spacy

I have a table which has people names in text. I would like to de identify that text by removing the people's names from every instance, while maintaining the rest of the sentence.

Row Num            Current Sent                            Ideal Sent
1                 Garry bought a cracker.                 bought a cracker.
2                 He named the parrot Eric.               He named the parrot.
3                 The ship was maned by Captain Jones.    The ship was maned by Captain.

How can I do that with Spacy? I know you have to identify the label as a 'PERSON' and then apply it to each row, but I can't seem to get the intended result. This is what I have so far:

def pro_nn_finder(text):
    doc = nlp(text)
    return[ent.text for ent in doc.ents if ent.label_ == 'PERSON']

df.apply(pro_nn_finder)

One approach:

import pandas as pd
import spacy

# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")

df = pd.DataFrame(data=["Garry bought a cracker.",
                        "He named the parrot Eric.",
                        "The ship was maned by Captain Jones."], columns=["Current Sent"])


def remove_person(txt):
    doc = nlp(txt)
    chunks = [range(entity.start_char, entity.end_char) for entity in doc.ents if entity.label_ == 'PERSON']
    to_remove = set().union(*chunks)
    return "".join(c for i, c in enumerate(txt) if i not in to_remove)


df["Ideal Sent"] = df["Current Sent"].apply(remove_person)
print(df)

Output

                           Current Sent                       Ideal Sent
0               Garry bought a cracker.                bought a cracker.
1             He named the parrot Eric.            He named the parrot .
2  The ship was maned by Captain Jones.  The ship was maned by Captain .

Here's how I would do it. The thing to note here is that nlp can take a long time, so I'd do it once, store the resulting doc objects in a new column, and then proceed with the filtering. Since you are interested in the whole document and not just the entities, it's better to use the Token.ent_type_ attribute than going the doc.ents route.

import spacy

nlp = spacy.load("en_core_web_md")
df = pd.DataFrame({"sent": ["Garry bought a cracker.",
                            "He named the parrot Eric.",
                            "The ship was maned by Captain Jones."]})

df["proc_sent"] = df.sent.apply(nlp) # expensive step

df["ideal_sent"] = df.proc_sent.apply(lambda doc: ' '.join(tok.text for tok in doc if tok.ent_type_ != "PERSON"))

Alternatively, you can explode the doc column so you end up with one token per cell. That allows for more panda-esque data processing.

df2 = df.explode("proc_sent")

Now df2.proc_sent looks like this:

0      Garry
0     bought
0          a
0    cracker
0          .
1         He
1      named
1        the
1     parrot
1       Eric
1          .

So you can filter out PERSON entities via

>>> df2[df2.proc_sent.apply(lambda tok: tok.ent_type_) != "PERSON"]
                                   sent proc_sent
0               Garry bought a cracker.    bought
0               Garry bought a cracker.         a
0               Garry bought a cracker.   cracker
0               Garry bought a cracker.         .
1             He named the parrot Eric.        He
1             He named the parrot Eric.     named
1             He named the parrot Eric.       the
1             He named the parrot Eric.    parrot
1             He named the parrot Eric.         .
...

Of course, that only makes sense if you need to do more complex things because to get the sentences strings you need to do a groupby etc. making it more complicated overall for this application.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM