简体繁体中英

Document classification using machine learning

原文 2020-02-13 11:55:57 5 2 python/ machine-learning/ nlp/ computer-vision

I am currently working on a project, where I need to be able to dynamically classify incoming documents. These documents can come in text PDF files as well as scanned PDF files.

I have the following labels:

Invoice
Packing list
Certificate

I am trying to figure out how I should approach this problem.

My initial thoughts

I was thinking the best way to solve this issue would be to perform text classification, based on the document text.

Step 1 - Train a model

First convert the PDF files to text.
Then label the text content by one of the three labels. (Do this for a large dataset)

Step 2 - Use the model

Once the model is trained, for new incoming documents, convert it to text.
Run the text content through the model to get the text classification.

Is there another way to do this? My concerns are that I am not sure if you can perform NLP on entire text documents? Maybe object detection (Computer Vision) is needed instead?

2 answers

Computer vision would be faster and my first choice in your use case. Are the three types of documents visually different when you look at them in terms of layout? Certificates probably have a different "look" and "layout" but packing lists and invoices may look similar. You would want to convert PDF into page images and train and run an image classification model first. You should use transfer learning on a pre-trained image classification model like ResNet.

You can perform NLP on "entire documents" but it works best on prose text and not text on invoices or packing list. You can look up sentence embedding models (Infersent, Google USE, BERT) that can actually be used to classify full page text and not just sentences. Although some of them can be computationally expensive.

I understand your problem. Some key point about it a) First do pre-processing of input data. ie ( for eg how many pages have in invoice or Certificate in pdf ). Then, convert pdf into TiFF images.

b) Trained Model using Image, Visual\\layout and text. You will get good accuracy. c) You can used Computer vison and deep learning (Keras and tensorflow)

machine learning binary classification

Machine Learning - Classification Problem

Machine Learning - Classification or Clustering

Python Machine Learning Picture Classification

Text Preprocessing for classification - Machine Learning

Machine learning: Classification on imbalanced data

Which Machine Learning classification to use?

Text classification without machine learning, deep learning

Serialization, classification in pyBrain, machine learning, prediction

machine learning - multi label classification svm

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question machine learning binary classification Machine Learning - Classification Problem Machine Learning - Classification or Clustering Python Machine Learning Picture Classification Text Preprocessing for classification - Machine Learning Machine learning: Classification on imbalanced data Which Machine Learning classification to use? Text classification without machine learning, deep learning Serialization, classification in pyBrain, machine learning, prediction machine learning - multi label classification svm

Related Tags

Document classification using machine learning

Question

My initial thoughts

2 answers

solution1
1 2020-02-14 19:06:42

solution2
0 2021-07-09 06:58:57

Document classification using machine learning

Question

My initial thoughts

2 answers

solution1 1 2020-02-14 19:06:42

solution2 0 2021-07-09 06:58:57

solution1
1 2020-02-14 19:06:42

solution2
0 2021-07-09 06:58:57