简体   繁体   中英

Merging PDFs using reportlab and PyPDF2 loses images and embedded fonts

I am trying to take an existing PDF stored on AWS, read it into my backend (Django 1.1, Python 2.7) and add text into the margin. My current code successfully takes in the PDF and adds text to the margin, but it corrupts the PDF:

When opening in the browser:

  1. Removes pictures
  2. Occasionally adds characters between words
  3. Occasionally completely changes the character set of the PDF

When opening in Adobe:

  1. Says "Cannot extract the embedded font 'whatever font name'. Some characters many not display or print correctly"
  2. Says "A drawing error occured"
  3. If there were pictures pre-edit, says "Insufficient data for an image"

I have made my own PDF with/without predefined fonts and with/without images. The ones with predefined fonts and no images work as expected, but with images it throws "There was an error while reading a stream." when opening in Adobe, and just doesn't show the images in the browser. I have come to the conclusion that missing fonts is the reason for the problems with the characters, but I'm not sure why the images aren't showing.

I don't have control over the contents of the PDFs I'm editing so I can't ensure they only use the predefined fonts, and they definitely will need to have images in them. Below is my code

from reportlab.pdfgen import canvas

from PyPDF2 import PdfFileWriter, PdfFileReader
from StringIO import StringIO

class DownloadMIR(APIView):
    permission_classes = (permissions.IsAuthenticated,)

    def post(self, request, format=None):
        data = request.data

        file_path = "some_path"
        temp_file_path = "some_other_path"

        # read your existing PDF

        if default_storage.exists(file_path):
            existing_pdf = PdfFileReader(default_storage.open(file_path, 'rb'))
        else:
            raise Http404("could not find pdf")

        packet = StringIO()
        # create a new PDF with Reportlab
        can = canvas.Canvas(packet)
        height, width = int(existing_pdf.getPage(0).mediaBox.getUpperRight_x()), int(
            existing_pdf.getPage(0).mediaBox.getUpperRight_y())
        print("width:" + str(width) + " height: " + str(height))
        can.setPageSize([width, height])
        can.rotate(90)
        footer = "Prepared for " + request.user.first_name + " " + request.user.last_name + " on " + datetime.now().strftime('%Y-%m-%d at %H:%M:%S')
        can.setFont("Courier", 8)
        can.drawCentredString(width / 2, -15, footer)
        can.save()

        packet.seek(0)
        new_pdf = PdfFileReader(packet)

        output = PdfFileWriter()
        for index in range(existing_pdf.numPages):
            page = existing_pdf.getPage(index)
            page.mergePage(new_pdf.getPage(0))
            output.addPage(page)
            #print("done page " + str(index))

        response = HttpResponse(content_type="application/pdf")

        response['Content-Disposition'] = 'attachment; filename=' + temp_file_path

        output.write(response)
        return response

Using a script I found online , I see that there are unembedded fonts.

Font List
['/MPDFAA+DejaVuSansCondensed', '/MPDFAA+DejaVuSansCondensed-Bold
', '/MPDFAA+DejaVuSansCondensed-BoldOblique', '/MPDFAA+DejaVuSans
Condensed-Oblique', '/ZapfDingbats']

Unembedded Fonts
set(['/MPDFAA+DejaVuSansCondensed-Bold', '/ZapfDingbats', '/MPDFA
A+DejaVuSansCondensed-BoldOblique', '/MPDFAA+DejaVuSansCondensed'
, '/MPDFAA+DejaVuSansCondensed-Oblique'])

The questions are these - is there a way to extract the embedded font from the original PDF and embed it in the new pdf; and is there something I'm not doing properly which is causing the images to not embed?

After some testing, I discovered the problem wasn't with the generated PDF, rather the returning of the PDF as a response. If I saved my PDF to the bucket and downloaded it from the AWS CLI, it worked. I did not figure out how to fix the response to properly send the PDF back to the front end.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM