简体   繁体   English

如何使用 python 从 pdf 的每个超链接中提取锚文本/单词?

[英]How to extract anchor text/ words from every hyperlinks from pdf using python?

I am trying to extract hyperlink present in each page with their anchor text from pdf using PymuPdf library.我正在尝试使用 PymuPdf 库从 pdf 中提取每个页面中的超链接及其锚文本。 I am able to extract hyperlinks with their page numbers but couldn't able to extract anchor text/words for every hyperlinks.我能够提取带有页码的超链接,但无法为每个超链接提取锚文本/单词。

Can anyone help me?谁能帮我?

Here is the code这是代码

import fitz # PyMuPDF

result = []

with fitz.open(file) as doc:

    for page_no in range(1, len(doc)+1):

        page = doc[page_no-1]

        for link in page.links():

            if "uri" in link:

                url = link["uri"]
                result.append([page_no, url])  

            else:
                pass
            

Thanks!谢谢!

You can extract the text within the link's "hot area", link["from"] like this: text = page.get_textbox(link["from"]) .您可以像这样提取链接的“热点区域” link["from"]中的文本: text = page.get_textbox(link["from"])

Also any other of the various page.get_text() variants can be used if you need more text detail (eg color, font, ...) by using the clip parameter.如果您需要更多文本细节(例如颜色、字体等),则可以使用clip参数使用任何其他page.get_text()变体。 For example, page.get_text("dict", clip=link["from"]) delivers a dictionary of the text under the link rectangle with font name, font size, font color and more.例如, page.get_text("dict", clip=link["from"])提供链接矩形下方文本的字典,其中包含字体名称、字体大小、字体颜色等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM