简体   繁体   中英

How to extract specific information from emails using machine learning?

I have multiple emails with a list of stock, price and quantity. Each day, the list is formatted a little differently and I was hoping to use NLP to try to understand read in the data and reformat it to show the information in a correct format.

Here is a sample of the emails I receive:

Symbol  Quantity    Rate
AAPL    16        104
MSFT    8.3k      56.24
GS      34        103.1
RM      3,400     -10
APRN    6k        11
NP      14,000    -44

As we can see, the quantity is in varying formats, the ticker always is standard but the rate is either positive or negative or could have decimals. Another issue is that the headers are not always the same so that is not an identifier that I can rely on.

So far I've seen some examples online where this works for names but I am unable to implement this for stock ticker, quantity and price. The code I've tried so far is below:

import re
import nltk
from nltk.corpus import stopwords
stop = stopwords.words('english')

string = """

To: "Anna Jones" <anna.jones@mm.com>
From: James B.

Hey,
This week has been crazy. Attached is my report on IBM. Can you give it a quick read and provide some feedback.
Also, make sure you reach out to Claire (claire@xyz.com).
You're the best.
Cheers,
George W.
212-555-1234
"""


def extract_phone_numbers(string):
    r = re.compile(r'(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})')
    phone_numbers = r.findall(string)
    return [re.sub(r'\D', '', number) for number in phone_numbers]


def extract_email_addresses(string):
    r = re.compile(r'[\w\.-]+@[\w\.-]+')
    return r.findall(string)


def ie_preprocess(document):
    document = ' '.join([i for i in document.split() if i not in stop])
    sentences = nltk.sent_tokenize(document)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    return sentences


def extract_names(document):
    names = []
    sentences = ie_preprocess(document)
    for tagged_sentence in sentences:
        for chunk in nltk.ne_chunk(tagged_sentence):
            if type(chunk) == nltk.tree.Tree:
                if chunk.label() == 'PERSON':
                    names.append(' '.join([c[0] for c in chunk]))
    return names


if __name__ == '__main__':
    numbers = extract_phone_numbers(string)
    emails = extract_email_addresses(string)
    names = extract_names(string)

    print(numbers)
    print(emails)
    print(names)

This code does a good job with numbers, emails and names but I am unable to replicate this for the example I have and do not really know how to go about it. Any tips will be more than helpful.

You can construct the regexes that will check for numbers and amounts.

For the sticks however, you will have to do something differently. I suspect that the stock names are not always written in uppercase letters in email. If they are then just write a script that will utilize an API from some of the stock exchanges and run only the words that have all the letters in the uppercase form. But, if the stock names are not written in uppercase letters in the emails, you can do several things. You can check every word from the email against that stock exchange if it's a stick name. If you want to speed up that process, you can try doing dependency parsing and run only the nouns or pronouns against the API.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM