First, use tika to to convert PDF to text.
import re
import sys
!{sys.executable} -m pip install tika
from tika import parser
from io import StringIO
from itertools import islice
file = 'filename with directory'
parsedPDF = parser.from_file(file) # Parse data from file
text = parsedPDF['content'] # Get files text content
Now extract desired fields using regex. You can find extensive regex tutorials online. If you have any problem implementing the same, please ask here.
Try to use tika package:
from tika import parser
raw = parser.from_file('sample.pdf')
print(raw['content'])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.