how to extract fields from pdf in python using pdfminer

Question

I have a pdf form that I need to extract email id, name of the person and other information like skills, city, etc..how can I do that using pdfminer3. please find attached sample of pdf

Answer 1

First, use tika to to convert PDF to text.

import re
import sys
!{sys.executable} -m pip install tika
from tika import parser
from io import StringIO
from itertools import islice 

file = 'filename with directory'
parsedPDF = parser.from_file(file) # Parse data from file
text = parsedPDF['content'] # Get files text content

Now extract desired fields using regex. You can find extensive regex tutorials online. If you have any problem implementing the same, please ask here.

Answer 2

Try to use tika package:

from tika import parser

raw = parser.from_file('sample.pdf')
print(raw['content'])

how to extract fields from pdf in python using pdfminer

Question

2 answers

solution1
1 2019-11-15 08:21:07

solution2
0 2019-11-15 08:06:03

how to extract fields from pdf in python using pdfminer

Question

2 answers

solution1 1 2019-11-15 08:21:07

solution2 0 2019-11-15 08:06:03

solution1
1 2019-11-15 08:21:07

solution2
0 2019-11-15 08:06:03