简体繁体 English

如何使用 python docx2txt 模块从 docx 文件中检索特定部分

[英]How to Retrieve a particular section from docx file using python docx2txt module

原文 2020-04-02 05:19:30 6 1 python

我正在使用 python docx2txt模块来处理 docx 文件。现在我可以通过使用这个模块来获取整个文档。但我的要求是逐节检索文档（例如单独使用其内容的标题）。请帮助我获取使用 docx2txt 库的特定部分或标题。

1 个解决方案

The docx2text module has no built-in function for that. docx2text模块没有内置函数。

That means you have a couple of options.这意味着你有几个选择。

Try to recognize section headers and sections from the converted text.尝试从转换后的文本中识别节标题和节。 This will probably be difficult because it is hard to distinguish a one-sentence paragraph from a section header.这可能会很困难，因为很难将一个句子的段落与节标题区分开来。
Open the docx file using the zipfile module.使用zipfile模块打开docx文件。 Then read the word/document.xml file from the zipfile and extract the information.然后从 zipfile 中读取word/document.xml文件并提取信息。 This will give you the complete XML structure, so it should be possible to recognize section headers.这将为您提供完整的 XML 结构，因此应该可以识别节标题。
Use python-docx , like in this question .使用python-docx ，就像在这个问题中一样。

All computerized processing requires that the documents have a consistent internal structure.所有计算机化处理都要求文件具有一致的内部结构。 If you have a document that uses real section headers but also formatted lines to start a section, your conversion is bound to fail.如果您的文档使用真实的部分标题但也使用格式化的行来开始一个部分，那么您的转换肯定会失败。

使用 Python docx2txt 从 Word 文档中提取图像 - Extracting Images from Word Documents Using Python docx2txt

如何获取docx2txt来处理目录中的所有docx文件？ - How do i get docx2txt to process all docx files in directory?

Python中如何将.docx转.txt - How to convert .docx to .txt in Python

使用 python-docx 从 .docx 文件中提取图像位置 - Extract image position from .docx file using python-docx

如何使用 python-docx 从现有的 docx 文件中提取文本 - How to extract text from an existing docx file using python-docx

如何使用python-docx在docx文件中写入多个表？ - How to write multiple tables in docx file using python-docx?

如何替换 .docx 文件中的多个单词并使用 python-docx 保存 docx 文件 - How to replace multiple words in .docx file and save the docx file using python-docx

如何使用docx中的python-docx识别分页符 - How to identify page breaks using python-docx from docx

如何使用 python-docx 从 docx 文档中删除一行 - How to delete a line from a docx document using python-docx

如何使用python-docx在多个表中检索特定的表数据？ - how to retrieve particular table data in multiple tables using python-docx?

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Python docx2txt 从 Word 文档中提取图像 - Extracting Images from Word Documents Using Python docx2txt 如何获取docx2txt来处理目录中的所有docx文件？ - How do i get docx2txt to process all docx files in directory? Python中如何将.docx转.txt - How to convert .docx to .txt in Python 使用 python-docx 从 .docx 文件中提取图像位置 - Extract image position from .docx file using python-docx 如何使用 python-docx 从现有的 docx 文件中提取文本 - How to extract text from an existing docx file using python-docx 如何使用python-docx在docx文件中写入多个表？ - How to write multiple tables in docx file using python-docx? 如何替换 .docx 文件中的多个单词并使用 python-docx 保存 docx 文件 - How to replace multiple words in .docx file and save the docx file using python-docx 如何使用docx中的python-docx识别分页符 - How to identify page breaks using python-docx from docx 如何使用 python-docx 从 docx 文档中删除一行 - How to delete a line from a docx document using python-docx 如何使用python-docx在多个表中检索特定的表数据？ - how to retrieve particular table data in multiple tables using python-docx?

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM