簡體 English 中英

如何從 PDF 上的特定位置/跨度提取文本 (PyPDF2)

[英]How to extract text (PyPDF2) from specific location/span on PDF

原文 2021-11-02 20:26:32 0 1 python/ pdf/ text/ scrape/ pypdf2

我已經將 PDF 頁面中的文本提取到 Text 變量。 我想提取字符串 'your number is' 之后的數字（14 長度的字符串在 span (982,996) 上匹配：

object=PyPDF2.PdfFileReader(filename)
Text = PageObj.extractText()
PageObj = object.getPage(0) 
ResSearch = re.search(String, Text)

我得到一個結果：span = (982, 996) match = '你的號碼是'。 現在我所需要的就是刮掉后面的三位數文本（'你的號碼是105 '），因為文件每天都在變化，而且提取應該是動態的。
謝謝大家！！

1 個解決方案

問題是關於正則表達式而不是 pdf 本身。 在假設每頁最多一個匹配項的情況下，您可以使用search ，否則使用findall 。 查看有關如何使用 group 和(...)部分的文檔。

import PyPDF2, re

filename = '' # 

pdf_r = PyPDF2.PdfFileReader(filename)
text = pdf_r.getPage(0).extractText() # from 1st page or make a loop

if p := re.match(r'your number is (\d{3})'):
   my_number = int(p.search(text).groups()[0]) # as int

使用 PyPDF4，語法是一樣的，它不“有”這樣的問題：

來自doc ：這對某些 PDF 文件很有效，但對其他人來說效果不佳，具體取決於所使用的生成器。 [...] 不要依賴於這個函數的文本順序，因為如果這個函數變得更復雜，它會改變。

PyPDF2 和 PyPDF4 無法從 PDF 中提取文本

[英]PyPDF2 and PyPDF4 fails to extract text from the PDF

使用Pypdf2從網頁轉換的pdf中提取文本

[英]Extract text from pdf converted from webpage using Pypdf2

使用PyPDF2從目錄中的PDF文件中提取文本

[英]Extract text from PDF files in a directory using PyPDF2

PyPDF2 不會從 PDF 中提取所有文本

[英]PyPDF2 won't extract all text from PDF

PyPDF2從掃描的pdf中提取垂直文本

[英]PyPDF2 to extract vertical text from scanned pdf

從 PDF url 中提取文本與 io 和 PyPDF2 沒有給出 Z78E6221F6393D1356CEZ8681

[英]Extract text from PDF url with io and PyPDF2 gives no output

使用 Python 和 PyPDF2 從 PDF 文件中提取文本

[英]Extract text from PDF File using Python with PyPDF2

如何使用 Pypdf2 從 Pdf 中提取文本，不包括圖表和表格中的文本內容

[英]How to extract text from Pdf using Pypdf2 excluding the text content from Charts and Tables

如何使用 PYPDF2 從 pdf 中提取表值？

[英]How to extract table value from pdf using PYPDF2?

如何使用PyPDF2從.pdf文件中提取所有文本並將其作為STRING返回？

[英]How to use PyPDF2 to extract all the text from a .pdf file and return it as a STRING?

暫無

暫無

聲明:本站的技術帖子網頁，遵循CC BY-SA 4.0協議，如果您需要轉載，請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

相關問題 PyPDF2 和 PyPDF4 無法從 PDF 中提取文本使用Pypdf2從網頁轉換的pdf中提取文本使用PyPDF2從目錄中的PDF文件中提取文本 PyPDF2 不會從 PDF 中提取所有文本 PyPDF2從掃描的pdf中提取垂直文本從 PDF url 中提取文本與 io 和 PyPDF2 沒有給出 Z78E6221F6393D1356CEZ8681 使用 Python 和 PyPDF2 從 PDF 文件中提取文本如何使用 Pypdf2 從 Pdf 中提取文本，不包括圖表和表格中的文本內容如何使用 PYPDF2 從 pdf 中提取表值？如何使用PyPDF2從.pdf文件中提取所有文本並將其作為STRING返回？

相關標簽

粵ICP備18138465號 © 2020-2024 STACKOOM.COM