I have a data frame named "df1". This data frame has 12 columns. The last column in this data frame is called notes. I need to replace common names like "john, sally and richard" from this column and replace the values with xxxx or something similar. I have a working script that is creating this data frame from MS SQL. I have spent several hours and used various resources to try and get some code that works to do this but I have not been successful. I do not have to use Spacy, but I was told this is a good package to work with. Any help would be appreciated.
You need to use a solution like
import spacy
import pandas as pd
# Test dataframe
df = pd.DataFrame({'notes':["Peter A. Smith came to see Bart in Washington on Tuesday."]})
print(df['notes'])
# => 0 Peter A. Smith came to see Bart in Washington on Tuesday.
## <<PERSON>> came to see <<PERSON>> in <<GPE>> on <<DATE>>.
nlp = spacy.load('en_core_web_trf')
def redact_with_spacy(text: str) -> str:
doc = nlp(text)
newString = text
for e in reversed(doc.ents):
if e.label_ == "PERSON": # Only redact PERSON entities
start = e.start_char
end = start + len(e.text)
newString = newString[:start] + "xxxx" + newString[end:]
return newString
df['notes'] = df['notes'].apply(redact_with_spacy)
print(df['notes'])
Output:
0 xxxx came to see xxxx in Washington on Tuesday.
Note you may adjust the "xxxx"
in the redact_with_spacy
function. Eg, you may replace the found entity with the same amount of x
s if you use newString = newString[:start] + ("x" * len(e.text)) + newString[end:]
. Or, to keep spaces, newString = newString[:start] + "".join(["x" if not x.isspace() else " " for x in e.text]) + newString[end:]
.
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_md") #model
Here I'm removing all labels in the data frame column, you can select which labels to remove
nlp.get_pipe('ner').labels
'CARDINAL', 'DATE', 'EVENT', 'FAC', 'GPE', 'LANGUAGE', 'LAW', 'LOC', 'MONEY', 'NORP', 'ORDINAL', 'ORG', 'PERCENT', 'PERSON', 'PRODUCT', 'QUANTITY', 'TIME', 'WORK_OF_ART'
def replace_text(mytext):
labels = list(['PERSON']) #here selecting PERSON labels from the model
doc = nlp(mytext)
labels_to_remove = []
final_string = str(mytext)
for sent in doc.ents:
if sent.label_ in labels:
labels_to_remove.append(str(sent.text))
for n in range(len(labels_to_remove)):
final_string = final_string.replace(labels_to_remove[n],'[REDACTED]')
return final_string
df['Sentences_Redacted_Function'] = df['Sentences'].apply(lambda x:replace_text(x))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.