简体   繁体   中英

Generating PDF documents from large text files using Python's ReportLab package is slow

I have a very large number of text files that I need to convert to PDFs (using Python 3.8.5), and then separate the content by page breaks. The page breaks are encoded in these text files as form feeds and are represented in Python with the substring \x0c . I am able to read the text in and split the document by these form feeds. Then, I use the package reportlab to create a PDF with the correct pagination. This is a condensed version of my code:

import glob
from reportlab.lib,enums import TA_JUSTIFY
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, PageBreak, Spacer
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib.units import inch

file = glob.glob(wdir + text_folder + "/**/*.txt", recursive=True)
for i in file:
     doc = SimpleDocTemplate(i[:-4] + ".pdf", pagesize=letter, rightmMargin=72, leftMargin=72, topMargin=72, bottomMargin=18)
     f = open(i, encoding='utf-8')
     k = f.read()
     k_breaks = k.split("\x0c")
     Story = []
     styles=getSampleStyleSheet()
     styles.add(ParagraphStyle(name='Justify', alignment=TA_JUSTIFY))
     for j in range(len(k_breaks)):
          ptext='<font size="12">' + k_breaks[j] + '</font>'
          Story.append(Paragraph(ptext, styles["Justify"]))
          Story.append(Spacer(1,12))
          if j != len(k_breaks)-1:
               Story.append(PageBreak())
     doc.build(Story)

Through tracing, I've found that my code seems to reach a bottleneck on the lines

          Story.append(Paragraph(ptext, styles["Justify"]))
          Story.append(Spacer(1,12))

Although, this is really just an issue on large text files (upwards of 1 or 2 mb). Smaller text files within the 100kb range are not too slow, but these larger files take hours on hours. When they finish, the resulting PDFs are hundreds or thousands of pages long. I want to reduce the processing time. Is there a better way within reportlab to do this, or a suggested change in methodology- perhaps via a different package?

You can check pdfme library. It's the most powerful library in python to create PDF documents.

I can't tell if it would be faster with those huge files, but you could give it a try and check the following code:

import glob
from pdfme import build_pdf

file = glob.glob(wdir + text_folder + "/**/*.txt", recursive=True)
for i in file:
     f = open(i, encoding='utf-8')
     k = f.read()
     k_breaks = k.split("\x0c")
     sections = [{"content": [k_break]} for k_break in k_breaks]
     with open(i[:-4] + ".pdf", 'wb') as f:
         build_pdf({
             "style": {"s": 12, "text_aling": "j"},
             "page_style": {"page_size": "letter", "margin": [72,72,18,72]},
             "sections": sections
         }, f)

Check the docs here .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM