python - 如何提取 DOCX 超链接的文本？

Question

Building on this solution :基于此解决方案：

from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT

document = Document('test.docx')
rels = document.part.rels

def iter_hyperlink_rels(rels):
    for rel in rels:
        if rels[rel].reltype == RT.HYPERLINK:
            yield rels[rel]._target      

print(iter_hyperlink_rels(rels)

I need to get both the url and the text of the hyperlink (eg mydomain.com for the url and Go to My Domain for the text)我需要获取超链接的 url和文本（例如mydomain.com的 url 和Go to My Domain的文本）

Answer 1

Answering my own question, I had to go via html to do this:回答我自己的问题，我不得不通过html来做到这一点：

from bs4 import BeautifulSoup
with open('my_word_file.htm', 'r') as file:
    page = file.read()
soup = BeautifulSoup(page, 'lxml')

text_and_url = []
for link in soup.findAll('a'):
    text_and_url.append({'text':link.string, 'url':link.get('href')})

Foor conversion of docx file html :转换docx文件html ：

how to convert .docx file to html using python? 如何使用python将.docx文件转换为html？

python - 如何提取 DOCX 超链接的文本？

问题描述

1 个解决方案

解决方案1
0 2019-07-25 14:19:30

python - 如何提取 DOCX 超链接的文本？

问题描述

1 个解决方案

解决方案1 0 2019-07-25 14:19:30

解决方案1
0 2019-07-25 14:19:30