簡體   English   中英

如何在python中使用正則表達式刪除XML標簽?

[英]How to remove XML tags using regular expression in python?

python中的字符串可以包含某些純文本以及一些包含某些信息的XML標記。 例如:

The student XYZ abc has been terminated from the institute. 
you can find the details of student below:
<info StatusCode="End">
    <user_detail>
        <name>
            <first_name>ABC</first_name>
            <last_name>XYZ</last_name>
        </name>
        <contact_details>
            <contact_number>
                <number_type>landline</number_type>
                <number>1234567</number>
            </contact_number>
            <address>
                <address_field1> lorem ipsum, qwerty </address_field1>
                <address_field2> lorem ipsum2, qwerty2 </address_field2>
                <city> asdfgh </city>
                <state> zxcvbn </state>
                <country> India </country>
            </address>
        </contact_details>
    </user_detail>
    <flight_detail>
        ...
    </flight_detail>
</info>
Lorem ipsum dolor sit amet, pro ea dicat velit regione, modo putant 
sensibus pri id, ut bonorum scripserit sit. Ex nec tation alienum, est ut 
nemore efficiendi interpretaris, vis te reque eleifend. 
<xml_tag>
...
</xml_tag>
Laudem delectus
reprehendunt ei mei, has nisl dolorem mnesarchum no, ad eos modo singulis
euripidis. Quo no consul offendit. Eu alia utroque argumentum vix, no 
case primis eum.
<xml_tag>
....
</xml_tag>

並不確定XML的開始標記為<info> ,它可以為<session StatusCode="End"> ,在這種情況下,結束標記將為</session> 目前,我正在使用刪除此xml標簽

data = re.sub(r'<[^<]+>', "", data)

但是,現在我想從該文本中刪除所有XML內容。我現在想要的最終輸出是:

The student XYZ abc has been terminated from the institute. 
you can find the details of student below:
Lorem ipsum dolor sit amet, pro ea dicat velit regione, modo putant 
sensibus pri id, ut bonorum scripserit sit. Ex nec tation alienum, est ut 
nemore efficiendi interpretaris, vis te reque eleifend. 
Laudem delectus
reprehendunt ei mei, has nisl dolorem mnesarchum no, ad eos modo singulis
euripidis. Quo no consul offendit. Eu alia utroque argumentum vix, no 
case primis eum. 

我嘗試使用</\\S+>匹配,但是它將刪除直到第一個關閉XML標記。 如何從也可以包含簡單文本的純文本字符串中刪除所有XML內容。

<(.*?>)(.*)</\\1單行選項,與您要刪除的XML匹配。 innerxml在第二組中

參見https://regex101.com/r/HwiA2t/1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM