[英]how to extract text from PDF file using python , i never did this and not getting the DOM of PDF file
[英]Extract embedded file from PDF with python
見圖片。
一直在嘗試使用 pyPDF2 從 PDF 文檔中提取嵌入式 csv 文件,但我只是不明白 PDF 並且似乎無法獲得有用的錯誤響應。
嘗試使用 stream 方法,大綱方法,緩存...沒有
如何提取 CSV 文件?
謝謝!
嘗試將 pdf 中的數據(CSV 數據)復制粘貼(手動)到記事本並以“.csv”格式保存,然后使用 pandas.read_csv 讀取文件! 試試這個,讓我知道它是否有效!
#This module contains all the functions for working with PDF documents.
import PyPDF2 as pf
# Step 1 Read pdf into a variable
pdf = pf.PdfFileReader('*your file location*')
# Step 2 "The process of traversing the PDF tree structure"
catalog = pdf.trailer['/Root']
fDetail = catalog['/Names']['/EmbeddedFiles']['/Names']
soup = fDetail[1].getObject()
# Step 3 Stream data to a variable for further use
file = soup['/EF']['/F'].getData()
Further information can be found on these 2 resources https://pythonhosted.org/PyPDF2/ https://fossies.org/dox/openslides-2.3-portable/classPyPDF2_1_1generic_1_1EncodedStreamObject.html
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.