如何使用 python 从 pdf 的每个超链接中提取锚文本/单词？

Question

I am trying to extract hyperlink present in each page with their anchor text from pdf using PymuPdf library.我正在尝试使用 PymuPdf 库从 pdf 中提取每个页面中的超链接及其锚文本。 I am able to extract hyperlinks with their page numbers but couldn't able to extract anchor text/words for every hyperlinks.我能够提取带有页码的超链接，但无法为每个超链接提取锚文本/单词。

Can anyone help me?谁能帮我？

Here is the code这是代码

import fitz # PyMuPDF

result = []

with fitz.open(file) as doc:

    for page_no in range(1, len(doc)+1):

        page = doc[page_no-1]

        for link in page.links():

            if "uri" in link:

                url = link["uri"]
                result.append([page_no, url])  

            else:
                pass

Thanks!谢谢！

Answer 1

You can extract the text within the link's "hot area", link["from"] like this: text = page.get_textbox(link["from"]) .您可以像这样提取链接的“热点区域” link["from"]中的文本： text = page.get_textbox(link["from"]) 。

Also any other of the various page.get_text() variants can be used if you need more text detail (eg color, font, ...) by using the clip parameter.如果您需要更多文本细节（例如颜色、字体等），则可以使用clip参数使用任何其他page.get_text()变体。 For example, page.get_text("dict", clip=link["from"]) delivers a dictionary of the text under the link rectangle with font name, font size, font color and more.例如， page.get_text("dict", clip=link["from"])提供链接矩形下方文本的字典，其中包含字体名称、字体大小、字体颜色等。

如何使用 python 从 pdf 的每个超链接中提取锚文本/单词？

问题描述

1 个解决方案

解决方案1
1 2022-10-09 21:52:07

如何使用 python 从 pdf 的每个超链接中提取锚文本/单词？

问题描述

1 个解决方案

解决方案1 1 2022-10-09 21:52:07

解决方案1
1 2022-10-09 21:52:07