I have a table which has people names in text. I would like to de identify that text by removing the people's names from every instance, while maintaining the rest of the sentence.
Row Num Current Sent Ideal Sent
1 Garry bought a cracker. bought a cracker.
2 He named the parrot Eric. He named the parrot.
3 The ship was maned by Captain Jones. The ship was maned by Captain.
How can I do that with Spacy? I know you have to identify the label as a 'PERSON' and then apply it to each row, but I can't seem to get the intended result. This is what I have so far:
def pro_nn_finder(text):
doc = nlp(text)
return[ent.text for ent in doc.ents if ent.label_ == 'PERSON']
df.apply(pro_nn_finder)
One approach:
import pandas as pd
import spacy
# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")
df = pd.DataFrame(data=["Garry bought a cracker.",
"He named the parrot Eric.",
"The ship was maned by Captain Jones."], columns=["Current Sent"])
def remove_person(txt):
doc = nlp(txt)
chunks = [range(entity.start_char, entity.end_char) for entity in doc.ents if entity.label_ == 'PERSON']
to_remove = set().union(*chunks)
return "".join(c for i, c in enumerate(txt) if i not in to_remove)
df["Ideal Sent"] = df["Current Sent"].apply(remove_person)
print(df)
Output
Current Sent Ideal Sent
0 Garry bought a cracker. bought a cracker.
1 He named the parrot Eric. He named the parrot .
2 The ship was maned by Captain Jones. The ship was maned by Captain .
Here's how I would do it. The thing to note here is that nlp
can take a long time, so I'd do it once, store the resulting doc
objects in a new column, and then proceed with the filtering. Since you are interested in the whole document and not just the entities, it's better to use the Token.ent_type_
attribute than going the doc.ents
route.
import spacy
nlp = spacy.load("en_core_web_md")
df = pd.DataFrame({"sent": ["Garry bought a cracker.",
"He named the parrot Eric.",
"The ship was maned by Captain Jones."]})
df["proc_sent"] = df.sent.apply(nlp) # expensive step
df["ideal_sent"] = df.proc_sent.apply(lambda doc: ' '.join(tok.text for tok in doc if tok.ent_type_ != "PERSON"))
Alternatively, you can explode
the doc
column so you end up with one token per cell. That allows for more panda-esque data processing.
df2 = df.explode("proc_sent")
Now df2.proc_sent
looks like this:
0 Garry
0 bought
0 a
0 cracker
0 .
1 He
1 named
1 the
1 parrot
1 Eric
1 .
So you can filter out PERSON entities via
>>> df2[df2.proc_sent.apply(lambda tok: tok.ent_type_) != "PERSON"]
sent proc_sent
0 Garry bought a cracker. bought
0 Garry bought a cracker. a
0 Garry bought a cracker. cracker
0 Garry bought a cracker. .
1 He named the parrot Eric. He
1 He named the parrot Eric. named
1 He named the parrot Eric. the
1 He named the parrot Eric. parrot
1 He named the parrot Eric. .
...
Of course, that only makes sense if you need to do more complex things because to get the sentences strings you need to do a groupby
etc. making it more complicated overall for this application.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.