简体   繁体   中英

Document classification using machine learning

I am currently working on a project, where I need to be able to dynamically classify incoming documents. These documents can come in text PDF files as well as scanned PDF files.

I have the following labels:

  • Invoice
  • Packing list
  • Certificate

I am trying to figure out how I should approach this problem.

My initial thoughts

I was thinking the best way to solve this issue would be to perform text classification, based on the document text.

Step 1 - Train a model

  • First convert the PDF files to text.
  • Then label the text content by one of the three labels. (Do this for a large dataset)

Step 2 - Use the model

  • Once the model is trained, for new incoming documents, convert it to text.
  • Run the text content through the model to get the text classification.

Is there another way to do this? My concerns are that I am not sure if you can perform NLP on entire text documents? Maybe object detection (Computer Vision) is needed instead?

Computer vision would be faster and my first choice in your use case. Are the three types of documents visually different when you look at them in terms of layout? Certificates probably have a different "look" and "layout" but packing lists and invoices may look similar. You would want to convert PDF into page images and train and run an image classification model first. You should use transfer learning on a pre-trained image classification model like ResNet.

You can perform NLP on "entire documents" but it works best on prose text and not text on invoices or packing list. You can look up sentence embedding models (Infersent, Google USE, BERT) that can actually be used to classify full page text and not just sentences. Although some of them can be computationally expensive.

I understand your problem. Some key point about it a) First do pre-processing of input data. ie ( for eg how many pages have in invoice or Certificate in pdf ). Then, convert pdf into TiFF images.

b) Trained Model using Image, Visual\\layout and text. You will get good accuracy. c) You can used Computer vison and deep learning (Keras and tensorflow)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM