简体   繁体   中英

How do I sort an unordered list using BeautifulSoup 4 by a child element

I am new to Python coding and BeautifulSoup4. I have a list in HTML that I need to sort, which follows the pattern:

<div id="mgioLangSelector">
<ul id="mgioLangList">
<li><a href="" class="mgio-autonym"><span class="mgioAutonymNative" lang="am">አማርኛ</span><span class="mgioAutonymSeperator"> / </span><span class="mgioAutonymEnglish">Amharic</span</a></li>
<li><a href="" class="mgio-autonym"><span class="mgioAutonymNative" lang="hr">hrvatski</span><span class="mgioAutonymSeperator"> / </span>        <span class="mgioAutonymEnglish">Croatian</span</a></li>
<li><a href="" class="mgio-autonym"><span class="mgioAutonymNative" lang="cs">čeština</span><span class="mgioAutonymSeperator"> / </span><span class="mgioAutonymEnglish">Czech</span</a></li>
<li><a href="" class="mgio-autonym"><span class="mgioAutonymNative" lang="vi">tiếng Việt</span><span class="mgioAutonymSeperator"> / </span><span class="mgioAutonymEnglish">Vietnamese</span</a></li>
<li><a href="" class="mgio-autonym"><span class="mgioAutonymNative" lang="sq">shqip</span><span class="mgioAutonymSeperator"> / </span><span class="mgioAutonymEnglish">Albanian</span</a></li>
</ul>
</div>

I need to sort the list in situ and save the resultant HTML. The list needs to be sorted by the contents of the third span, with class = mgioAutonymEnglish

I suspect that I need to use sorted() with an appropriate key function, but am coming up blank.

I have tried the following code:

from bs4 import BeautifulSoup
from lxml import etree
soup = BeautifulSoup(open("interimResults.html"), 'lxml', from_encoding="utf-8")
matches = soup.find_all("span", attrs={"class": "mgioAutonymEnglish"})
sorted(matches, key=lambda elem: elem.text)

This will sort the contents of the span, but not the lists in the original list. I assume that I need to change the lambda function, but I'm currently at a loss.

What would I need to do or change to successfully sort the list and then save those changes within the HTML document?

This is actually a little more involved than you might think, so it'll help to go through it step by step.

Let's start with your soup :

>>> soup
<html><body>
...
<div id="mgioLangSelector">
<ul id="mgioLangList">
<li><a class="mgio-autonym" href=""><span class="mgioAutonymNative" lang="am">አማርኛ</span><span class="mgioAutonymSeperator"> / </span><span class="mgioAutonymEnglish">Amharic</span></a></li>
<li><a class="mgio-autonym" href=""><span class="mgioAutonymNative" lang="hr">hrvatski</span><span class="mgioAutonymSeperator"> / </span> <span class="mgioAutonymEnglish">Croatian</span></a></li>
<li><a class="mgio-autonym" href=""><span class="mgioAutonymNative" lang="cs">čeština</span><span class="mgioAutonymSeperator"> / </span><span class="mgioAutonymEnglish">Czech</span></a></li>
<li><a class="mgio-autonym" href=""><span class="mgioAutonymNative" lang="vi">tiếng Việt</span><span class="mgioAutonymSeperator"> / </span><span class="mgioAutonymEnglish">Vietnamese</span></a></li>
<li><a class="mgio-autonym" href=""><span class="mgioAutonymNative" lang="sq">shqip</span><span class="mgioAutonymSeperator"> / </span><span class="mgioAutonymEnglish">Albanian</span></a></li>
</ul>
</div>
...
</body></html>

The first thing to do is get hold of the ul :

ul = soup.find(attrs={"id": "mgioLangList"})

Now we can extract() all the li elements from it and store them in a list:

items = [li.extract() for li in ul.find_all("li")]

To sort the list, we need a key. You were on the right lines, but actually it needs to look like this:

items.sort(key=lambda e: e.find(attrs={"class": "mgioAutonymEnglish"}).string)

Now that we have a sorted list of li elements, we can insert them back into the document. What might not be immediately obvious is that the ul didn't just contain li elements – they were separated by linebreaks, which are still there:

>>> ul
<ul id="mgioLangList">





</ul>

... and in fact, they're still six separate '\\n' strings:

>>> ul.contents
['\n', '\n', '\n', '\n', '\n', '\n']

In order to insert our sorted list of li elements back inbetween those linebreaks, we can use the builtin zip() function and the insert_after() method:

for linebreak, li in reversed(list(zip(ul.contents, items))):
    linebreak.insert_after(li)

Note that, because we're modifying ul.contents by inserting elements as we iterate over it, it's necessary to do so in reverse so that the for loop doesn't end up tripping over itself.

And so ...

>>> soup
<html><body>
...
<div id="mgioLangSelector">
<ul id="mgioLangList">
<li><a class="mgio-autonym" href=""><span class="mgioAutonymNative" lang="sq">shqip</span><span class="mgioAutonymSeperator"> / </span><span class="mgioAutonymEnglish">Albanian</span></a></li>
<li><a class="mgio-autonym" href=""><span class="mgioAutonymNative" lang="am">አማርኛ</span><span class="mgioAutonymSeperator"> / </span><span class="mgioAutonymEnglish">Amharic</span></a></li>
<li><a class="mgio-autonym" href=""><span class="mgioAutonymNative" lang="hr">hrvatski</span><span class="mgioAutonymSeperator"> / </span> <span class="mgioAutonymEnglish">Croatian</span></a></li>
<li><a class="mgio-autonym" href=""><span class="mgioAutonymNative" lang="cs">čeština</span><span class="mgioAutonymSeperator"> / </span><span class="mgioAutonymEnglish">Czech</span></a></li>
<li><a class="mgio-autonym" href=""><span class="mgioAutonymNative" lang="vi">tiếng Việt</span><span class="mgioAutonymSeperator"> / </span><span class="mgioAutonymEnglish">Vietnamese</span></a></li>
</ul>
</div>
...
</body></html>

Here's the whole thing:

ul = soup.find(attrs={"id": "mgioLangList"})
items = [li.extract() for li in ul.find_all("li")]
items.sort(key=lambda e: e.find(attrs={"class": "mgioAutonymEnglish"}).string)
for linebreak, li in reversed(list(zip(ul.contents, items))):
    linebreak.insert_after(li)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM