[英]How to extract anchor text/ words from every hyperlinks from pdf using python?
I am trying to extract hyperlink present in each page with their anchor text from pdf using PymuPdf library.我正在尝试使用 PymuPdf 库从 pdf 中提取每个页面中的超链接及其锚文本。 I am able to extract hyperlinks with their page numbers but couldn't able to extract anchor text/words for every hyperlinks.
我能够提取带有页码的超链接,但无法为每个超链接提取锚文本/单词。
Can anyone help me?谁能帮我?
Here is the code这是代码
import fitz # PyMuPDF
result = []
with fitz.open(file) as doc:
for page_no in range(1, len(doc)+1):
page = doc[page_no-1]
for link in page.links():
if "uri" in link:
url = link["uri"]
result.append([page_no, url])
else:
pass
Thanks!谢谢!
You can extract the text within the link's "hot area", link["from"]
like this: text = page.get_textbox(link["from"])
.您可以像这样提取链接的“热点区域”
link["from"]
中的文本: text = page.get_textbox(link["from"])
。
Also any other of the various page.get_text()
variants can be used if you need more text detail (eg color, font, ...) by using the clip
parameter.如果您需要更多文本细节(例如颜色、字体等),则可以使用
clip
参数使用任何其他page.get_text()
变体。 For example, page.get_text("dict", clip=link["from"])
delivers a dictionary of the text under the link rectangle with font name, font size, font color and more.例如,
page.get_text("dict", clip=link["from"])
提供链接矩形下方文本的字典,其中包含字体名称、字体大小、字体颜色等。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.