[英]read multiple files and save to xls in columns (pypdf2 and xlsxwriterr
我需要獲取一個包含多個 PDF 的目錄並將其結構化為一個 xls
但我不明白如何在目錄中制作列表將數據保存在xls中
enter import PyPDF2
import xlsxwriter
#---------------------Input file-----------------------------------#
pdf_file = open('arquivo_file','rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
doc = read_pdf.getOutlines
page_content = page.extractText()
text = page_content.replace("\n", " ").replace("\t", " ").replace(" ", "")
content = page_content.split("\n")
data = content[0]
worksheet.write(1, 1, data)
workbook.close() here
通常,您的代碼將類似於這樣的內容。
import os
import glob
DIRPATH = "/path/to/your/pdf/directory"
# Get list of files with extension .pdf in a given directory
pdf_filepaths = glob.glob(os.path.join(DIRPATH, '*.pdf'))
# Loop over the pdf file-paths
# For each pdf-file:
# 1. read each pdf file
# 2. process the content you read (optional)
# 3. save the processed content to excel file
for i, pdf_filepath in enumerate(pdf_filepaths):
content = read_pdf_file(pdf_filepath)
content = process_data(content)
write_excel_file(filename='out_{i}.xlsx', content=content)
在這里,我假設您將讀取、處理和寫入邏輯包裝在三個函數中:
def read_pdf_file(filepath):
# your pdf reading logic goes here
...
return content
def process_data(content):
# your post-reading data-processing logic goes here
...
return content
def write_excel_file(filepath, content):
# your logic for writing to excel-file goes here
...
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.