If I have a nested HTML (unordered) list that looks like this:
<<ul style="">
<li class="jstree-last jstree-open" id="wfo-7000000004">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-7000000004">
<ins class="jstree-icon"> </ins>
Acoraceae
</a>
<ul style="">
<li class="jstree-last jstree-open" id="wfo-4000000350">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-4000000350">
<ins class="jstree-icon"> </ins>
Acorus
</a>
<ul style="">
<li class="jstree-open" id="wfo-0000350733">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350733">
<ins class="jstree-icon"> </ins>
Acorus calamus
</a>
<ul style="">
<li class="jstree-leaf" id="wfo-0000350841">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350841">
<ins class="jstree-icon"> </ins>
Acorus calamus var. americanus
</a>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000350949">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350949">
<ins class="jstree-icon"> </ins>
Acorus calamus var. angustatus
</a>
</li>
</ul>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000352676">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000352676">
<ins class="jstree-icon"> </ins>
Acorus gramineus
</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
How do I form a nested dictionary out of it in Python? For example:
{
Acorales: {
Acoraceae: {
Acorus: {
Acoruscalamus: [
Acoruscalamusvar.americanus,
Acoruscalamusvar.angustatus
],
Acorusgramineus
}
}
}
}
I presume libraries like Beautiful Soup and HTML Parser have facilities to do this (with for loops in python), but I haven't been able it figure it out. Thanks for any help!
I tried this way:
def create_dic(soup):
return {li.a.get_text().replace("\xa0", ""): create_dic(li)
for ul in soup('ul', recursive=False)
for li in ul('li', recursive=False)}
However, the output is like this (where Acorus calamus var. americanus and Acorus calamus var. angustatus shoud be in a list, and Acorus gramineus not a dictionary):
{'Acorales': {'Acoraceae': {'Acorus': {'Acorus calamus': {'Acorus calamus var. americanus': {},
'Acorus calamus var. angustatus': {}},
'Acorus gramineus': {}}}}}
I will answer the question, because to make the answer from Parsing nested HTML list with BeautifulSoup work, you have to call beautifulsoup to parse your html uls. I also marked the question as duplicate so if its duplicate just close/delete.
from bs4 import BeautifulSoup
htmlbody = '''
<<ul style="">
<li class="jstree-last jstree-open" id="wfo-7000000004">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-7000000004">
<ins class="jstree-icon"> </ins>
Acoraceae
</a>
<ul style="">
<li class="jstree-last jstree-open" id="wfo-4000000350">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-4000000350">
<ins class="jstree-icon"> </ins>
Acorus
</a>
<ul style="">
<li class="jstree-open" id="wfo-0000350733">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350733">
<ins class="jstree-icon"> </ins>
Acorus calamus
</a>
<ul style="">
<li class="jstree-leaf" id="wfo-0000350841">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350841">
<ins class="jstree-icon"> </ins>
Acorus calamus var. americanus
</a>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000350949">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000350949">
<ins class="jstree-icon"> </ins>
Acorus calamus var. angustatus
</a>
</li>
</ul>
</li>
<li class="jstree-last jstree-leaf" id="wfo-0000352676">
<ins class="jstree-icon"> </ins>
<a class="" href="taxon/wfo-0000352676">
<ins class="jstree-icon"> </ins>
Acorus gramineus
</a>
</li>
</ul>
</li>
</ul>
</li>
</ul>
'''
def ul_to_dict(ul):
result = {}
for li in ul.find_all("li", recursive=False):
key = next(li.stripped_strings)
ul = li.find("ul")
if ul:
result[key] = ul_to_dict(ul)
else:
result[key] = None
return result
# Let BeautifulSoup do it's magic and parse ul from the HTML.
htmlbody = BeautifulSoup(htmlbody).ul
# run our function
ul_to_dict(htmlbody)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.