Python——使用docx从word文档中提取具有特定特征的文本

Question

I have a question on using python to identify texts with certain features from word document我有一个关于使用 python 从 word 文档中识别具有某些特征的文本的问题

I wish to extract texts that are bold and that have quotations around them for example:我希望提取粗体且周围有引号的文本，例如：

" This is a " sentence " in word document. " “这是word文档中的一个“句子”。

How can I identify the word " sentence " in python?如何识别python中的“句子”一词？

This is what I have at the moment:这就是我目前所拥有的：

from docx import Document    
document = Document(filepath)
short_list = []
for paragraph in document.paragraphs:
    for run in paragraph.runs:
       if run.bold:
          short_list.append(run.text)

Thank you all for your help!谢谢大家的帮助！

Answer 1

A bit tricky solution: first convert your docx file to html format using mammoth and then parse it with regex :有点棘手的解决方案：首先使用mammoth将您的docx文件转换为html格式，然后使用regex解析它：

import re
import mammoth

with open('file.docx', 'rb') as f:
    html = mammoth.convert_to_html(f).value
    result = re.findall('&quot;<strong>(.*?)<\/strong>&quot;', html)

I've created sample docx file with text in body and in footnote:我在正文和脚注中创建了带有文本的示例docx文件：

Here is my output:这是我的输出：

['sentence', 'one more sentence', 'final sentence']

Answer 2

I would assume you cannot.我认为你不能。

A docx file is in fact a zip file, and according to the documentation of the Python docx module, the Document object represents the document.xml part of the file.一个 docx 文件实际上是一个 zip 文件，根据 Python docx 模块的Document ， Document对象代表文件的 document.xml 部分。 Unfortunately, footnotes are stored in a different part: footnotes.xml.不幸的是，脚注存储在不同的部分：footnotes.xml。

As on PyPi the modules declares its developpement status as 3-alpha, I suppose that it cannot currently process footnotes.在 PyPi 上，模块将其开发状态声明为 3-alpha，我想它目前无法处理脚注。

IMHO, you should first look if there are already open issues about the question, and if yes comment on it, or else fill a new issue on the project page .恕我直言，您应该首先查看该问题是否已经存在未解决的问题，如果是，请对其发表评论，或者在项目页面上填写新问题。

Answer 3

Try using below example code:尝试使用以下示例代码：

for paragraph in document.paragraphs:
    if 'sea' in paragraph.text:
        print paragraph.text
        paragraph.text = 'new text containing ocean'

To search in Tables as well, you would need to use something like:要在表格中进行搜索，您需要使用以下内容：

for table in document.tables:
    for cell in table.cells:
        for paragraph in cell.paragraphs:
            if 'sea' in paragraph.text:
               ...

See How to use python-docx to replace text in a Word document and save请参阅如何使用 python-docx 替换 Word 文档中的文本并保存

Python——使用docx从word文档中提取具有特定特征的文本

问题描述

3 个解决方案

解决方案1
0 2020-01-24 13:13:52

解决方案2
0 2020-01-24 13:17:48

解决方案3
-2 2020-01-24 12:10:03

Python——使用docx从word文档中提取具有特定特征的文本

问题描述

3 个解决方案

解决方案1 0 2020-01-24 13:13:52

解决方案2 0 2020-01-24 13:17:48

解决方案3 -2 2020-01-24 12:10:03

解决方案1
0 2020-01-24 13:13:52

解决方案2
0 2020-01-24 13:17:48

解决方案3
-2 2020-01-24 12:10:03