[英]python - How to extract the text of DOCX hyperlinks?
Building on this solution :基于此解决方案:
from docx import Document
from docx.opc.constants import RELATIONSHIP_TYPE as RT
document = Document('test.docx')
rels = document.part.rels
def iter_hyperlink_rels(rels):
for rel in rels:
if rels[rel].reltype == RT.HYPERLINK:
yield rels[rel]._target
print(iter_hyperlink_rels(rels)
I need to get both the url and the text of the hyperlink (eg mydomain.com
for the url and Go to My Domain
for the text)我需要获取超链接的 url和文本(例如
mydomain.com
的 url 和Go to My Domain
的文本)
Answering my own question, I had to go via html
to do this:回答我自己的问题,我不得不通过
html
来做到这一点:
from bs4 import BeautifulSoup
with open('my_word_file.htm', 'r') as file:
page = file.read()
soup = BeautifulSoup(page, 'lxml')
text_and_url = []
for link in soup.findAll('a'):
text_and_url.append({'text':link.string, 'url':link.get('href')})
Foor conversion of docx
file html
:转换
docx
文件html
:
how to convert .docx file to html using python? 如何使用python将.docx文件转换为html?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.