简体   繁体   English

从.docx文档的xml中提取数据

[英]extracting data from xml of .docx document

I need to extract the data between tags as mentioned below.Also,I want to concatenate the data if the data is in corresponding to the same id.我需要提取标签之间的数据,如下所述。另外,如果数据对应于相同的 id,我想连接数据。

For examle,as per the below XML both tags are within the tabs corresponding to the same id "00F1234A" Hence "Hello World" needs to be extracted.例如,根据下面的 XML 两个标签都在对应于相同 ID“00F1234A”的选项卡内,因此需要提取“Hello World”。

xml_string="
<w:r w:rsid="00F1234A">     
    <w:rPr> 

    </w:rPr>
    <w:t>Hello</w:t>
</w:r>   


<w:r w:rsid="00F1234A">     
    <w:rPr> 

    </w:rPr>
    <w:t xml:space="preserve">World</w:t>
</w:r>"

currently, am extracting data between tags with the following regex目前,我正在使用以下正则表达式在标签之间提取数据

re.findall("<w:t>(.+?)</w:t>",xml_string)

this gives me Hello, but not Hello World这给了我 Hello,但不是 Hello World

how can i concatenate the data in corresponding to the same id,which in this case is "00F1234A"如何连接对应于相同 id 的数据,在本例中为“00F1234A”

In order to parse this, you'll need the namespaces from the XML ( xmlns: x = "urn:something" ).为了解析它,您需要 XML ( xmlns: x = "urn:something" ) 中的命名空间。

Use etrees to extract the values instead of regex like so:使用 etrees 来提取值,而不是像这样的正则表达式:

 import xml.etree.ElementTree as ET
#parse XML string
tree = ET.fromstring('xml_string')

#declare namespace dictionary
nsmap = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

tagvalues = []
#loop through all w:t tags and append their values to list
for i in root.findall('.//w:r//w:t', nsmap):
    tagvalues.append(i.text)

#concatenate all values into a string
string  = ''
[string.join(word) for word in tagvalues]

Check out this post as well.也看看这篇文章

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM