简体   繁体   English

Python——使用docx从word文档中提取具有特定特征的文本

[英]Python - using docx to extract texts with certain features from word documents

I have a question on using python to identify texts with certain features from word document我有一个关于使用 python 从 word 文档中识别具有某些特征的文本的问题

I wish to extract texts that are bold and that have quotations around them for example:我希望提取粗体且周围有引号的文本,例如:

" This is a " sentence " in word document. " “这是word文档中的一个“句子”。

How can I identify the word " sentence " in python?如何识别python中的“句子”一词?

This is what I have at the moment:这就是我目前所拥有的:

from docx import Document    
document = Document(filepath)
short_list = []
for paragraph in document.paragraphs:
    for run in paragraph.runs:
       if run.bold:
          short_list.append(run.text)

Thank you all for your help!谢谢大家的帮助!

A bit tricky solution: first convert your docx file to html format using mammoth and then parse it with regex :有点棘手的解决方案:首先使用mammoth将您的docx文件转换为html格式,然后使用regex解析它:

import re
import mammoth

with open('file.docx', 'rb') as f:
    html = mammoth.convert_to_html(f).value
    result = re.findall('&quot;<strong>(.*?)<\/strong>&quot;', html)

I've created sample docx file with text in body and in footnote:我在正文和脚注中创建了带有文本的示例docx文件:

在此处输入图片说明

Here is my output:这是我的输出:

['sentence', 'one more sentence', 'final sentence']

I would assume you cannot.我认为你不能。

A docx file is in fact a zip file, and according to the documentation of the Python docx module, the Document object represents the document.xml part of the file.一个 docx 文件实际上是一个 zip 文件,根据 Python docx 模块的DocumentDocument对象代表文件的 document.xml 部分。 Unfortunately, footnotes are stored in a different part: footnotes.xml.不幸的是,脚注存储在不同的部分:footnotes.xml。

As on PyPi the modules declares its developpement status as 3-alpha, I suppose that it cannot currently process footnotes.在 PyPi 上,模块将其开发状态声明为 3-alpha,我想它目前无法处理脚注。

IMHO, you should first look if there are already open issues about the question, and if yes comment on it, or else fill a new issue on the project page .恕我直言,您应该首先查看该问题是否已经存在未解决的问题,如果是,请对其发表评论,或者在项目页面上填写新问题。

Try using below example code:尝试使用以下示例代码:

for paragraph in document.paragraphs:
    if 'sea' in paragraph.text:
        print paragraph.text
        paragraph.text = 'new text containing ocean'

To search in Tables as well, you would need to use something like:要在表格中进行搜索,您需要使用以下内容:

for table in document.tables:
    for cell in table.cells:
        for paragraph in cell.paragraphs:
            if 'sea' in paragraph.text:
               ...

See How to use python-docx to replace text in a Word document and save请参阅如何使用 python-docx 替换 Word 文档中的文本并保存

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 python -docx 从word docx中提取表格 - python -docx to extract table from word docx 使用 Python docx2txt 从 Word 文档中提取图像 - Extracting Images from Word Documents Using Python docx2txt 如何使用 python docx 从多个文件中提取 Word 表 - How to extract a Word table from multiple files using python docx 如何使用Python从Word(doc,docx)文件中提取文本框和流程图? - How to extract textbox & flowcharts from Word( doc,docx) files using Python? 使用 Python 从 .docx 文件中仅提取特定字体类型的文本 - Extract only text of a certain font type from a .docx file using Python 使用 python-docx 从 .docx 文件中提取图像位置 - Extract image position from .docx file using python-docx (Python)从目录或列表中提取特定单词? - (Python) extract certain word from directory or list? 如何<a>通过Python使用Selenium</a>从<a>标记中</a>提取所有文本 - How to extract all the texts from <a> tag using Selenium through Python 如何使用 python selenium 从 span 元素中提取多个文本? - How to extract multiple texts from span elements using python selenium? Python 使用 Python-docx 或任何其他库打印多个 word 文档的字符数(通过 GUI 对话框选择文档) - Python to Print No. of Characters for Multiple word documents using Python-docx or any other Library(selecting documents by GUI Dialog Box)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM