[英]Extracting Highlighted Words from Word Document (.docx) in Python
I am working with a bunch of word documents in which I have text (words) that are highlighted (using color codes eg yellow,blue,gray), now I want to extract the highlighted words associated with each color. 我正在使用一堆word文档,其中我有突出显示的文本(单词)(使用颜色代码,例如黄色,蓝色,灰色),现在我想提取与每种颜色相关联的突出显示的单词。 I am programming in Python.
我用Python编程。 Here is what I have done currently:
这是我目前所做的:
opened the word document with [python-docx][1]
and then get to the <w:r>
tag which contains the tokens (words) in the document. 使用
[python-docx][1]
打开word文档,然后转到包含文档中的标记(单词)的<w:r>
标记。 I have used following code: 我使用了以下代码:
#!/usr/bin/env python2.6
# -*- coding: ascii -*-
from docx import *
document = opendocx('test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
for word in words:
print word
Now I am stuck at the part where I check for each word if it has <w:highlight>
tag and extract the color code from it and if it matches to yellow print text inside <w:t>
tag. 现在我被困在我检查每个单词的部分,如果它有
<w:highlight>
标签并从中提取颜色代码,如果它与<w:t>
标签内的黄色打印文本匹配。 I will really appreciate if someone can point me towards extracting the word from the parsed file. 如果有人能指出我从解析文件中提取单词,我将非常感激。
I had never before worked with python-docx , but what helped was that I found a snippet online of how the XML structure of a highlighted piece of text lookls like: 我以前从未使用过python-docx ,但是有帮助的是我在网上找到了一个突出显示的文本的XML结构如下所示的片段:
<w:r>
<w:rPr>
<w:highlight w:val="yellow"/>
</w:rPr>
<w:t>text that is highlighted</w:t>
</w:r>
From there, it was relatively straightforward to come up with this: 从那里,提出这个是相对简单的:
from docx import *
document = opendocx(r'test.docx')
words = document.xpath('//w:r', namespaces=document.nsmap)
WPML_URI = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
tag_rPr = WPML_URI + 'rPr'
tag_highlight = WPML_URI + 'highlight'
tag_val = WPML_URI + 'val'
for word in words:
for rPr in word.findall(tag_rPr):
if rPr.find(tag_highlight).attrib[tag_val] == 'yellow':
print word.find(tag_t).text
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.