简体   繁体   English

lxml 查找两个标签之间的所有元素

[英]lxml find all elements between two tags

extracted a word document and search in this all bookmarks.提取一个word文档并在所有书签中搜索。 But the bookmark tag have no end tag, so lxml find only the bookmarkStart but not the elements between bookmarkStart and bookmarkEnd.但是书签标签没有结束标签,所以 lxml 只找到 bookmarkStart 而不是 bookmarkStart 和 bookmarkEnd 之间的元素。 How can i get all Elements within bookmarkStart and bookmarkEnd?如何获取 bookmarkStart 和 bookmarkEnd 中的所有元素? Thanks!谢谢!

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:cx="http://schemas.microsoft.com/office/drawing/2014/chartex" xmlns:cx1="http://schemas.microsoft.com/office/drawing/2015/9/8/chartex" xmlns:cx2="http://schemas.microsoft.com/office/drawing/2015/10/21/chartex" xmlns:cx3="http://schemas.microsoft.com/office/drawing/2016/5/9/chartex" xmlns:cx4="http://schemas.microsoft.com/office/drawing/2016/5/10/chartex" xmlns:cx5="http://schemas.microsoft.com/office/drawing/2016/5/11/chartex" xmlns:cx6="http://schemas.microsoft.com/office/drawing/2016/5/12/chartex" xmlns:cx7="http://schemas.microsoft.com/office/drawing/2016/5/13/chartex" xmlns:cx8="http://schemas.microsoft.com/office/drawing/2016/5/14/chartex" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:aink="http://schemas.microsoft.com/office/drawing/2016/ink" xmlns:am3d="http://schemas.microsoft.com/office/drawing/2017/model3d" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:oel="http://schemas.microsoft.com/office/2019/extlst" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:w16cex="http://schemas.microsoft.com/office/word/2018/wordml/cex" xmlns:w16cid="http://schemas.microsoft.com/office/word/2016/wordml/cid" xmlns:w16="http://schemas.microsoft.com/office/word/2018/wordml" xmlns:w16sdtdh="http://schemas.microsoft.com/office/word/2020/wordml/sdtdatahash" xmlns:w16se="http://schemas.microsoft.com/office/word/2015/wordml/symex" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 w16se w16cid w16 w16cex w16sdtdh wp14">
    <w:body>
        <w:p w14:paraId="2DDA6990" w14:textId="44789F6F" w:rsidR="0067078D" w:rsidRDefault="003F5B0A">
            <w:bookmarkStart w:id="0" w:name="testmark"/>
            <w:proofErr w:type="spellStart"/>
            <w:r>
                <w:t>sometext</w:t>
            </w:r>
            <w:bookmarkEnd w:id="0"/>
            <w:proofErr w:type="spellEnd"/>
        </w:p>
        <w:sectPr w:rsidR="0067078D">
            <w:pgSz w:w="11906" w:h="16838"/>
            <w:pgMar w:top="1417" w:right="1417" w:bottom="1134" w:left="1417" w:header="708" w:footer="708" w:gutter="0"/>
            <w:cols w:space="708"/>
            <w:docGrid w:linePitch="360"/>
        </w:sectPr>
    </w:body>
</w:document>
from lxml import etree as ET

ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
ns2 = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'

with open('document.xml', 'r', encoding='utf-8') as xml_file:
    tree_word = ET.parse(xml_file)

findall_param = 'w:bookmarkStart'
find_param = 'w:t'

root_word = tree_word.getroot()
field_content = tree_word.findall('.//'+findall_param, ns)

for bookmark in field_content:
    textmarker = bookmark.attrib[f"{ns2}name"]
    print(ET.tostring(bookmark))
    t = bookmark.find('.//w:t', ns)

If I understand you correctly, and based on the sample xml in the question, the following should get you at least close to what you are trying to do:如果我对您的理解正确,并且基于问题中的样本 xml,以下内容应该至少让您接近您正在尝试做的事情:

word = """[your sample xml]"""
doc = etree.XML(word.encode())
ns = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
start_param = 'w:bookmarkStart'
t_param = 'w:t'
end_param = "bookmarkEnd"

doc.xpath(f'/{start_param}',namespaces=ns)
for el in doc.xpath(f'//w:p[.//{book_param}]//{book_param}/following-sibling::*',namespaces=ns): 
    if etree.QName(el).localname==f"{end_param}":
        break
    else:
        if len(el.xpath(f'.//{t_param}',namespaces=ns) )>0:
           el.xpath(f'.//{t_param}',namespaces=ns)[0].text="some new text"
print(etree.tostring(doc).decode())

Try it on your actual document and see if it works.在您的实际文档上尝试一下,看看它是否有效。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM