使用 python 從 PDF 中提取嵌入文件

Question

CSV 內部 PDF

見圖片。

一直在嘗試使用 pyPDF2 從 PDF 文檔中提取嵌入式 csv 文件，但我只是不明白 PDF 並且似乎無法獲得有用的錯誤響應。

嘗試使用 stream 方法，大綱方法，緩存...沒有

如何提取 CSV 文件？

謝謝！

Answer 1

嘗試將 pdf 中的數據（CSV 數據）復制粘貼（手動）到記事本並以“.csv”格式保存，然后使用 pandas.read_csv 讀取文件！ 試試這個，讓我知道它是否有效！

Answer 2

#This module contains all the functions for working with PDF documents.
import PyPDF2 as pf  

# Step 1 Read pdf into a variable
pdf = pf.PdfFileReader('*your file location*')  

# Step 2 "The process of traversing the PDF tree structure"

catalog = pdf.trailer['/Root']  
fDetail = catalog['/Names']['/EmbeddedFiles']['/Names']  
soup = fDetail[1].getObject()  

# Step 3 Stream data to a variable for further use
file = soup['/EF']['/F'].getData()

Further information can be found on these 2 resources https://pythonhosted.org/PyPDF2/ https://fossies.org/dox/openslides-2.3-portable/classPyPDF2_1_1generic_1_1EncodedStreamObject.html

使用 python 從 PDF 中提取嵌入文件

問題描述

2 個解決方案

解決方案1
0 2020-07-18 17:33:25

解決方案2
0 2020-07-18 19:13:48

使用 python 從 PDF 中提取嵌入文件

問題描述

2 個解決方案

解決方案1 0 2020-07-18 17:33:25

解決方案2 0 2020-07-18 19:13:48

解決方案1
0 2020-07-18 17:33:25

解決方案2
0 2020-07-18 19:13:48