简体   繁体   中英

Finding the right word and row in the Financial Statement text file

I've used tesseract OCR in Python to convert the Financial statement pdfs to text files, while converting the long whitespaces into ";". So the text file looks pretty nice and the tables are looking good.

Using an example found here https://cdn.corporatefinanceinstitute.com/assets/AMZN-Cash-Flow.png

The table would be like the following:
Stock-based compensation;2,119;2,975;4,215
Other operating expense, net;155;160;202
Other expense (income), net;250;(20);(292)
Deferred income taxes;81;(246);(29)
...

OK, so the Task is to find the first sum after eg "Stock-based compensation" -> 2,119. I've faced at least 3 issues with this:

1st problem is that I always have the whole financial statement pdf to start with, which contains for example 20 pages and can contain the word "Stock-based compensation" multiple times in sentences like " ..is the date on which a stock-based compensation is granted... ".

2nd problem is that to find the right table in the financial statement. There can be various smaller tables where "Stock-based compensation" can occur. However, in this case lets say we are looking for table called "Consolidated statements of cash flow", not for example "Estimated budget for next fiscal year" etc.

3rd problem is the word itself. "Stock-based compensation" can vary in different forms, like "Stock based compensation", "Compensation", "Stock based compensation & other compensation" etc. However, as we now that this "Stock based compensation" in a form or another shall be anyway in the right table, it shouldn't be a major issue to find the right line.

I've used for example regex to narrow down the options for the right word I'm looking for like this

def find_sum(word_to_look_for):
txt_file = r"fina.txt"
find = pattern
digitals = "\d+|;" #trying to find if any digits or ";" can be found on the row
with open(txt_file, "r") as in_text:
    for row in in_text:
        if re.findall(find, row, flags=re.IGNORECASE):
            if re.findall(digitals, row):
                expense_row = row.split(";")[1].strip("-")
                expenses = re.sub("[^\d\,]", "", expense_row) #if e.g 2.512,00
                return expenses
            else:
                pass

This solves some issues, but I'm currently thinking about whether I should implement ML or NLP tech into this case, or would this be easy enough to solve with regex, just narrowing down the possible lines with n-amount of if-statements?

Use words to look for with whitespace characters, regular expression can be create in the function with .* joiners to allow matching anything between those words.

def find_sum(word_to_look_for):
    txt_file = r"fina.txt"
    find = ".*".join([re.escape(w) for w in word_to_look_for.split()]) + r"\D*(\d+(?:[,.]\d+)*)"
    with open(txt_file, "r") as in_text:
        for row in in_text:
            match = re.search(find, row, flags=re.IGNORECASE)
            if match:
                return match.group(1)
    return ""

Python code :

import re

def find_sum(word_to_look_for):
    txt_file = r"fina.txt"
    find = ".*".join([re.escape(w) for w in word_to_look_for.split()]) + r"\D*(\d+(?:[,.]\d+)*)"
    in_text = ['...', 'XXX', 'Stock-based compensation;2,234.55']
    #with open(txt_file, "r") as in_text:
    for row in in_text:
        match = re.search(find, row, flags=re.IGNORECASE)
        if match:
            return match.group(1)
    return ""

wtlf = "Stock based compensation"
print(find_sum(wtlf))

Results : 2,234.55

Regex :

Stock.*based.*compensation\D*(\d+(?:[,.]\d+)*)

EXPLANATION

--------------------------------------------------------------------------------
  Stock                    'Stock'
--------------------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  based                    'based'
--------------------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  compensation             'compensation'
--------------------------------------------------------------------------------
  \D*                      non-digits (all but 0-9) (0 or more times
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    \d+                      digits (0-9) (1 or more times (matching
                             the most amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture (0 or more
                             times (matching the most amount
                             possible)):
--------------------------------------------------------------------------------
      [,.]                     any character of: ',', '.'
--------------------------------------------------------------------------------
      \d+                      digits (0-9) (1 or more times
                               (matching the most amount possible))
--------------------------------------------------------------------------------
    )*                       end of grouping
--------------------------------------------------------------------------------
  )                        end of \1

Thanks for the answers, this

find = ".*".join([re.escape(w) for w in word_to_look_for.split()]) + r"\D*(\d+(?:[,.]\d+)*)"

worked well. However, I still got an issue as in my FinaStatements there are two almost identical tables, one for budgeting and one to show the numbers from last year. Both of those tables are marked with clear heading in top of the page. With this answer above, I get both of those matches, as the row itself looks similar, but I only want the other match.

I added a text indicating the start of each page within the OCR process, and then look for right page including the right table, but seems to be harder than thought. Any ideas for this?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM