简体   繁体   English

使用 Python 解析 xml 以提取 div 之间的内容

[英]Parse xml to extract the contents between div using Python

I have the below data which is stored in a string. It resembles xml. Is there a way that I can extract the contents of div class "page" and extract all the text under it?

Started with the below code.从下面的代码开始。 However tree.text returns None.然而 tree.text 返回 None 。

import xml.etree.ElementTree as ET   
xml = ET.fromstring(str_content)  
for tree in xml:    
        print(tree.text)



  <html xmlns="http://www.w3.org/1999/xhtml">
    <body><div class="page"><p />
    <p>Hi This is the content to be parsed!!! 
    Extract the text. 
    Done </p>
    <p />
    </div>
    <div class="page"><p />
    <p>Hi This is the content to be parsed!!! 
    Extract the text. 
    Done </p>
    <p />
    </div>
    </body></html>

Sample input and output for multiple <p> within div:

    <html xmlns='http://www.w3.org/1999/xhtml'>
    <body><div class='page'><p />
    <p>Text in 1st line
    </p>
    <p>Text in 2nd line
    </p>
    <p>Text in 3rd line</p>
    <p />
    </div>
    <div class='page'><p />
    <p>Text in 1st line 2nd page
    </p>
    <p>© Text in 2nd line 2nd page
    </p>
    <p>Text in 3rd line 2nd page
    </p>
    <p>Text in 4th line 2nd page.
        Still in the same para.
        I want to preserve spaces and newlines
    </p>
    <p>etc 
        etc,
        ectc
    </p>
    <p>some info | 2018-11-09 1</p>
    <p />
    </div>
    </body>
    </html>

Output for the above:以上输出:

Page no.1...第 1 页...

Text in 1st lineText in 2nd lineText in 3rd line第一行文字第二行文字第三行文字

Page no.2...第 2 页...

Text in 1st line 2nd page© Text in 2nd line 2nd pageText in 3rd line 2nd pageText in 4th line 2nd page.第一行第二页中的文字© 第二行第二页中的文字第三行第二页中的文字第四行第二页中的文字。 Still in the same para.仍然在同一个段落中。 I want to preserve spaces and newlinesetc etc, ectcsome info |我想保留空格和换行符等,ectcsome info | 2018-11-09 1 2018-11-09 1

This script uses beautifulsoup to locate the <div>s and extract the text from them.此脚本使用beautifulsoup来定位<div>s并从中提取文本。

data is the XML string from the question. data是问题中的 XML 字符串。

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'html.parser')

for num, page in enumerate(soup.select('.page'), 1):
    print('Page no.{}...'.format(num))
    print('-' * 80)

    print(page.get_text(strip=True))

    print()

Prints:印刷:

Page no.1...
--------------------------------------------------------------------------------
Hi This is the content to be parsed!!!
    Extract the text.
    Done

Page no.2...
--------------------------------------------------------------------------------
Hi This is the content to be parsed!!!
    Extract the text.
    Done

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM