简体   繁体   中英

Without using any additional libraries, how would someone approach the challenge of reading the metadata of .pdf files in Python?

I know this is not an easy question and I do not expect an easy answer. I want to learn more about this, and the only way to do it is the hard way.

What first steps should I take?

If you want to get 'CreationDate', 'Author' and this kind of entries you can try this quick and dirty solution. Normally this information in a pdf should look like this:

obj
<<
/Author(NameOfAuthor)
/CreationDate(D:20040910110429)
/Producer(AcrobatPdfWriter)
>>
endobj

Not sure if applies for all pdf formats but I got some decent data that you can 'clean-up' after. Only works if the entries are on separate lines.

metadata_fields = ['Creator', 'CreationDate', 'Producer', 'ModDate']
with open('path_to_your_file.pdf') as my_pdf:
  meta_values = [line.rstrip('\n') for line in my_pdf.readlines() 
             for item in metadata_fields if item in line]
  print meta_values

Output:

['<</Producer(AFPL Ghostscript 8.11)', '/CreationDate(D:20040910110429)',
 '/ModDate(D:20040910110429)', '/Creator(PDFCreator Version 0.8.0)']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM