简体   繁体   中英

Parse bullet list in correct order with beautifulsoup

I am trying to parse a website which has a structure that looks very similar to this:

<div class="InternaTesto">
<p class="MarginTop0">Paragraph 1</p><br>
<p>Paragraph 2</p><br>
<p><strong>Paragraph 3</strong></p><br>
<ul>
    <li style="margin: 0px; text-indent: 0px;"><em>List item 1</em></li>
    <li style="margin: 0px; text-indent: 0px;"><em>List item 2</em></li>
    <li style="margin: 0px; text-indent: 0px;"><em>List item 3</em></li>
    ... Some Other Items ...
</ul>
<p><strong>Paragraph 4</strong></p><br>
<ul>
    <li style="margin: 0px; text-indent: 0px;"><em>List item 1</em></li>
    <li style="margin: 0px; text-indent: 0px;"><em>List item 2</em></li>
    <li style="margin: 0px; text-indent: 0px;"><em>List item 3</em></li>
    ... Some Other Items ...
</ul>
... Some Other paragraphs ...
</div>

I am trying to extract the list items, and put them under the correct paragraph. Right now I am able to find the list items, but it is not in the correct order. Here is the code I am using:

textOfTheArticle=[] 

for p in rawArticleData.find('div', attrs={'class':'InternaTesto'}).find_all("p"):
    textOfTheArticle.append(p.get_text())
    print(p.get_text() + "\n")

Is there any way to create a sublist or a separate list with all the <li> items?

You can find all paragraphs and for each one get the 3rd next sibling:

from bs4 import BeautifulSoup

data = """
Your html here
"""

soup = BeautifulSoup(data)
for p in soup.find('div', attrs={'class':'InternaTesto'}).find_all("p"):
    print p.text, [li.text for li in list(p.next_siblings)[2].find_all('li')]

Prints:

Paragraph 1 []
Paragraph 2 []
Paragraph 3 [u'List item 1', u'List item 2', u'List item 3']
Paragraph 4 [u'List item 1', u'List item 2', u'List item 3']

A more reliable approach would be to iterate over next siblings for each paragraph until we hit the next paragraph tag:

soup = BeautifulSoup(data)
for p in soup.find('div', attrs={'class':'InternaTesto'}).find_all("p"):
    print p.text
    for sibling in p.next_siblings:
        if sibling.name == 'ul':
            print [li.text for li in sibling.find_all('li')]
        if sibling.name == 'p':
            break

Hope that helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM