简体   繁体   中英

How to filter specific <p> tags for <h2> tags using beautiful soup in python and then build a dictionary out of it

<h2>Summary</h2>
<p>This is summary one.</p>
<p>contains details of summary1.</p>

<h2>Software/OS</h2>
<p>windows xp</p>

<h2>HARDWARE</h2>
<p>Intel core i5</p>
<p>8 GB RAM</p>

I want to create a dictionary from above where keys = header tags and value = paragraph tags.

I want output in this format

{"summary":["This is summary one.","contains details of summary1."], "Software/OS": "windows xp", "HARDWARE": ["Intel core i5","8 GB RAM"]}

Can anyone help me with this. thanks in advance.

You can use this script to make a dictionary where keys are text from <h2> and values are lists of <p> texts:

from bs4 import BeautifulSoup


txt = '''<h2>Summary</h2>
<p>This is summary one.</p>
<p>contains details of summary1.</p>

<h2>Software/OS</h2>
<p>windows xp</p>

<h2>HARDWARE</h2>
<p>Intel core i5</p>
<p>8 GB RAM</p>'''

soup = BeautifulSoup(txt, 'html.parser')

out = {}
for p in soup.select('p'):
    out.setdefault(p.find_previous('h2').text, []).append(p.text)

print(out)

Prints:

{'Summary': ['This is summary one.', 'contains details of summary1.'], 'Software/OS': ['windows xp'], 'HARDWARE': ['Intel core i5', '8 GB RAM']}

If you don't want to have lists of length==1, you can do additionally:

for k in out:
    if len(out[k]) == 1:
        out[k] = out[k][0]

print(out)

Prints:

{'Summary': ['This is summary one.', 'contains details of summary1.'], 'Software/OS': 'windows xp', 'HARDWARE': ['Intel core i5', '8 GB RAM']}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM