简体   繁体   中英

Extracting Emails using NLP - Spacy Matcher and then encrypting and decrypting them

Image of csv file I have a csv file which looks like the image provided. I am reading the csv file, defined a pattern and using spacy Matcher. I am iterating through the rows and columns of the CSV file. My end goal is to identify the email Ids and SSN numbers as sensitive information and encrypt and decrypt them. But unfortunately all the information is getting encrypted and decrypted in the process.

import spacy
from spacy.matcher import Matcher
import csv
from cryptography.fernet import Fernet
from spacy.vocab import Vocab
from spacy.tokens import Span
from spacy import displacy
nlp = spacy.load('en_core_web_sm')
# pattern = [{"TEXT": {"REGEX": "[a-zA-z0-9-_.] +@[a-zA-z0-9-_.]+"}, "OP":"?"}]
pattern =[{"TEXT": {"REGEX": "[a-z0-9\.\-+_] +@[a-z0-9\.\-+_]+\.[a-z]+"}, "OP":"?"}]
matcher = Matcher(nlp.vocab)
matcher.add("Email", None, pattern)
some_list = []
count = 0

file = open("PIIsampleData.csv")
csv_file = csv.reader(file)
for row in csv_file:
    # print(row)
    for text in row:
        doc = nlp(text)
        # print(doc.ents)
        matches = matcher(doc)
        print(matches)

        ent = [(ent.text,ent.label_) for ent in doc.ents]
        for match_id, start, end in matches:
            span = doc[start:end]
            print(span)
            # print(span.text)
            # print(span.text)[Here is the image of csv file][1]
        #     entity = [(ent.text, ent.label_) for ent in doc.ents]
        if ent:
            some_dict = {}
            b = bytes(doc.text, 'utf-8')
            name = Fernet.generate_key()
            lock = Fernet(name)
            code = lock.encrypt(b)
            original = lock.decrypt(code)
            # print(code, original, doc.text)
            some_dict = {'code': code, 'original': original, 'label': ent}
            some_list.append(some_dict)
            # print(some_list)
            count = count + 1

I think i am missing something here, not sure if it is EntityRuler or something or some problem with my code.

for match_id, start, end in matches:
            span = doc[start:end]
            print(span)
This span is coming as blank. Ideally this should have fetched me the emails, right? 
The final output when i am printing some_list is all the columns encrypted and decrypted, where as i want only emails and SSN to be identified as sensitive information and encrypted. I know i haven't defined regex for SSN yet, so just help me with emails for now. 

csv

Employee Name,Employee Email,Phone,Personal number,Organisation number,SSN
Price Cummings,lectus.Nullam@tristique.ca,1-509-928-5746,0,364858-2678,795-63-3325
McKenzie G. Rios,ullamcorper.Duis@sempertellus.ca,1-118-309-0368,16680213 -6611,208206-7964,183-91-0062
Scarlet Estrada,ornare@dolorFuscemi.net,163-5585,16330216 -1611,727359-3280,739-89-4031
Virginia Knowles,lorem@dui.co.uk,874-2186,16691013 -3450,497114-4243,382-62-1298
Reed S. Pennington,nunc@Phasellusataugue.edu,358-0513,16930326 -4221,724596-5152,190-00-3181
Mona Nelson,pellentesque.Sed.dictum@luctus.com,1-681-841-0005,16750725 -6951,028041-2412,943-18-8562

I must say I don't understand why you're using spacy to match things you specify as a regular expression which could be matched equally well and more simply by ordinary regular expressions.

I've got it at least indicating email address that need encrypting - I'm sure you can do the rest.

I had to change your regex so it actally matched an email address. In particular you had +@ in your regular expression which seems like a way to not match an email address as they don't include a space before the @. I changed that to *@ but that space probably shouldn't be there. Also the bit before the space didn't have a + after it. And I simplified the bit after the @ , I'm sure you can sort that out.

Also I had to specifically skip the header row of the CSV.

You weren't actually detecting a match so you were encrypting everything - this code detects a non-empty span as requiring encrypting. You will have to handle if there are multiple span s in a match.

Python 3.8.3 code (you'll have to edit the print() statements where they use the new f"{value=}" syntax)

import spacy
from spacy.matcher import Matcher
import csv
from cryptography.fernet import Fernet
# UNUSED from spacy.vocab import Vocab
# UNUSED from spacy.tokens import Span
#UNUSED from spacy import displacy

nlp = spacy.load('en_core_web_sm')

# doesn't work pattern =[{"TEXT": {"REGEX": "[a-z0-9\.\-+_] +@[a-z0-9\.\-+_]+\.[a-z]+"}, "OP":"?"}]
pattern =[{"TEXT": {"REGEX": "[a-z0-9\.\-+_]+ *@[a-z0-9\.\-+_]+"}, "OP":"?"}]
matcher = Matcher(nlp.vocab)
matcher.add("Email", None, pattern)
some_list = []
count = 0

file = open("ex1.csv")
csv_file = csv.reader(file)
first=True
for row in csv_file:
    # skip the header row
    if first:
        first=False
        continue
    print( f"{row=}" )
    for text in row:
        print( "===============================" )
        print( f"{text=}" )
        doc = nlp(text)
        print(f"{doc=}" )
        print(f"{doc.ents=}" )
        matches = matcher(doc)
        print(f"{matches=}")

        encrypted=False
        for match_id, start, end in matches:
            span = doc[start:end]
            print(f"Match {match_id=} span {start=} {end=} {span=}")
            if span:
                print( f"ENCRYPT '{span}'" )
                encrypted=True
        if encrypted:
            # do something with the encrypted value
            pass
        else:
            print( f"NOT ENCYRPTED {text}" )
            # do something with the non-encrypted value
            pass

But this would be very much less complex (and therefore easier to implement) just using ordinary regular expressions. For example, this matches email addresses or SSN (the SSN mustn't be preceded or followed by a digit):

import csv
import re

emailpattern = r"([a-z0-9\.\-+_]+ *@[a-z0-9\.\-+_]+)"
emailorssnpattern=r"([a-z0-9\.\-+_]+ *@[a-z0-9\.\-+_]+|(?<!\d)\d{3}-\d{2}-\d{4}(?!\d))"

some_list = []
count = 0

file = open("ex1.csv")
csv_file = csv.reader(file)
first=True
for row in csv_file:
    # skip the header row
    if first:
        first=False
        continue
    print( f"{row=}" )
    for text in row:
        print( "===============================" )
        print( f"{text=}" )
        match = re.findall(emailorssnpattern,text)
        print( "match=",match )
        if match:
            print( f"ENCRYPT {text}" )
        else:
            print( f"NO NEED TO ENCRYPT {text}" )

Good luck!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM