简体   繁体   中英

how to extract fields from pdf in python using pdfminer

I have a pdf form that I need to extract email id, name of the person and other information like skills, city, etc..how can I do that using pdfminer3. 在此处输入图像描述 please find attached sample of pdf

First, use tika to to convert PDF to text.

import re
import sys
!{sys.executable} -m pip install tika
from tika import parser
from io import StringIO
from itertools import islice 

file = 'filename with directory'
parsedPDF = parser.from_file(file) # Parse data from file
text = parsedPDF['content'] # Get files text content

Now extract desired fields using regex. You can find extensive regex tutorials online. If you have any problem implementing the same, please ask here.

Try to use tika package:

from tika import parser

raw = parser.from_file('sample.pdf')
print(raw['content'])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM