[英]Python - using docx to extract texts with certain features from word documents
I have a question on using python to identify texts with certain features from word document我有一个关于使用 python 从 word 文档中识别具有某些特征的文本的问题
I wish to extract texts that are bold and that have quotations around them for example:我希望提取粗体且周围有引号的文本,例如:
" This is a " sentence " in word document. " “这是word文档中的一个“句子”。
How can I identify the word " sentence " in python?如何识别python中的“句子”一词?
This is what I have at the moment:这就是我目前所拥有的:
from docx import Document
document = Document(filepath)
short_list = []
for paragraph in document.paragraphs:
for run in paragraph.runs:
if run.bold:
short_list.append(run.text)
Thank you all for your help!谢谢大家的帮助!
A bit tricky solution: first convert your docx
file to html
format using mammoth
and then parse it with regex
:有点棘手的解决方案:首先使用
mammoth
将您的docx
文件转换为html
格式,然后使用regex
解析它:
import re
import mammoth
with open('file.docx', 'rb') as f:
html = mammoth.convert_to_html(f).value
result = re.findall('"<strong>(.*?)<\/strong>"', html)
I've created sample docx
file with text in body and in footnote:我在正文和脚注中创建了带有文本的示例
docx
文件:
Here is my output:这是我的输出:
['sentence', 'one more sentence', 'final sentence']
I would assume you cannot.我认为你不能。
A docx file is in fact a zip file, and according to the documentation of the Python docx module, the Document
object represents the document.xml part of the file.一个 docx 文件实际上是一个 zip 文件,根据 Python docx 模块的
Document
, Document
对象代表文件的 document.xml 部分。 Unfortunately, footnotes are stored in a different part: footnotes.xml.不幸的是,脚注存储在不同的部分:footnotes.xml。
As on PyPi the modules declares its developpement status as 3-alpha, I suppose that it cannot currently process footnotes.在 PyPi 上,模块将其开发状态声明为 3-alpha,我想它目前无法处理脚注。
IMHO, you should first look if there are already open issues about the question, and if yes comment on it, or else fill a new issue on the project page .恕我直言,您应该首先查看该问题是否已经存在未解决的问题,如果是,请对其发表评论,或者在项目页面上填写新问题。
Try using below example code:尝试使用以下示例代码:
for paragraph in document.paragraphs:
if 'sea' in paragraph.text:
print paragraph.text
paragraph.text = 'new text containing ocean'
To search in Tables as well, you would need to use something like:要在表格中进行搜索,您需要使用以下内容:
for table in document.tables:
for cell in table.cells:
for paragraph in cell.paragraphs:
if 'sea' in paragraph.text:
...
See How to use python-docx to replace text in a Word document and save请参阅如何使用 python-docx 替换 Word 文档中的文本并保存
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.