简体   繁体   English

python - 如何提取 DOCX 超链接的文本?

[英]python - How to extract the text of DOCX hyperlinks?

Building on this solution :基于此解决方案

from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT

document = Document('test.docx')
rels = document.part.rels

def iter_hyperlink_rels(rels):
    for rel in rels:
        if rels[rel].reltype == RT.HYPERLINK:
            yield rels[rel]._target      

print(iter_hyperlink_rels(rels)

I need to get both the url and the text of the hyperlink (eg mydomain.com for the url and Go to My Domain for the text)我需要获取超链接的 url文本(例如mydomain.com的 url 和Go to My Domain的文本)

Answering my own question, I had to go via html to do this:回答我自己的问题,我不得不通过html来做到这一点:

from bs4 import BeautifulSoup
with open('my_word_file.htm', 'r') as file:
    page = file.read()
soup = BeautifulSoup(page, 'lxml')

text_and_url = []
for link in soup.findAll('a'):
    text_and_url.append({'text':link.string, 'url':link.get('href')})

Foor conversion of docx file html :转换docx文件html

how to convert .docx file to html using python? 如何使用python将.docx文件转换为html?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM