Without using any additional libraries, how would someone approach the challenge of reading the metadata of .pdf files in Python?

Question

I know this is not an easy question and I do not expect an easy answer. I want to learn more about this, and the only way to do it is the hard way.

What first steps should I take?

Answer 1

If you want to get 'CreationDate', 'Author' and this kind of entries you can try this quick and dirty solution. Normally this information in a pdf should look like this:

obj
<<
/Author(NameOfAuthor)
/CreationDate(D:20040910110429)
/Producer(AcrobatPdfWriter)
>>
endobj

Not sure if applies for all pdf formats but I got some decent data that you can 'clean-up' after. Only works if the entries are on separate lines.

metadata_fields = ['Creator', 'CreationDate', 'Producer', 'ModDate']
with open('path_to_your_file.pdf') as my_pdf:
  meta_values = [line.rstrip('\n') for line in my_pdf.readlines() 
             for item in metadata_fields if item in line]
  print meta_values

Output:

['<</Producer(AFPL Ghostscript 8.11)', '/CreationDate(D:20040910110429)',
 '/ModDate(D:20040910110429)', '/Creator(PDFCreator Version 0.8.0)']

Without using any additional libraries, how would someone approach the challenge of reading the metadata of .pdf files in Python?

Question

1 answers

solution1
1 ACCPTED 2014-03-07 19:23:22

Without using any additional libraries, how would someone approach the challenge of reading the metadata of .pdf files in Python?

Question

1 answers

solution1 1 ACCPTED 2014-03-07 19:23:22

solution1
1 ACCPTED 2014-03-07 19:23:22